Skeleton-of-Thought (SoT) prompting enhances response generation by first creating a basic structure (skeleton) and then expanding it in parallel, reducing latency.
Two-stage process: SoT divides generation into a skeleton phase followed by a detailed expansion phase, improving efficiency and speed.
Faster inference: SoT delivers over 2x speed improvement on 8 out of 12 models, making it ideal for real-time applications.
Quality improvement: In 60% of cases, SoT generates answers with quality equal to or better than traditional methods.
Limitations include higher token usage costs and potential quality issues when points in the skeleton are interdependent.

What is Skeleton-of-Thought Prompting?

Most state-of-the-art Large Language Models (LLMs) rely on sequential decoding, which can lead to high latency. In contrast, humans approach problem-solving by first creating an outline or skeleton of their answer, then filling in details and supporting evidence.

Skeleton-of-Thought (SoT) Prompting^{1Xuefei Ning. (2023). Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation.} mimics this parallel process. It first instructs the LLM to generate a basic answer structure (the skeleton), and then expands on each point to create a detailed response. To optimize for speed, the detailed generation phase uses parallel API calls or batched decoding, reducing latency compared to traditional methods.

How to Use Skeleton-of-Thought Prompting?

SoT generates answers in two stages:

Skeleton stage
Point-expanding stage

Skeleton Stage

In the skeleton stage, SoT utilizes the skeleton prompt template to generate a skeleton answer.

Prompt

[User:] You're an organizer responsible for only giving the skeleton (not the full content) for answering the question. Provide the skeleton in a list of points (numbered 1., 2., 3., etc.) to answer the question. Instead of writing a full sentence, each skeleton point should be very short with only 3~5 words. Generally, the skeleton should have 3~10 points. Now, please provide the skeleton for the following question.

{question}

Skeleton:

[Assistant:] 1.

This skeleton can be directly fed to LLM to get the skeleton answer.

Let's use SoT to gain tips on reducing carbon emissions on a personal level.

Point-Expanding Stage

In this stage, SoT utilizes the point-expanding prompt template to expand the answer generated in the previous stage.

Prompt

[User:] You're responsible for continuing the writing of one and only one point in the overall answer to the following question.

{question}

The skeleton of the answer is

{skeleton}

Continue and only continue the writing of point {point index}. Write it very shortly in 1~2 sentence and do not continue with other points!

[Assistant:] {point index}.{Point skeleton}

The LLM is fed with the points generated using the previous stage and is asked to expand one point at a time. This is repeated for all points in the skeleton. This process can be parallelized to speed up the inference.

For LLMs with only API access, multiple parallel API requests can be sent to the provider.
For LLMs running locally, inference can be optimized by performing the operations in batch.

Now, let's use the point-expanding prompt to expand our previous skeleton.

What Are Skeleton-of-Thought Prompting Results?

On 8 out of 12 models, SoT obtains a speed-up of at least 2x.

Speed-up gained after employing SoT (Ning et al.)

The quality of answers generated by SoT is either comparable or better than that of normal generation in 60% of the cases.

Quality evaluation of answers generated using SoT

Quality evaluation across two metrics: FastChat and LLMZoo (Ning et al.)

Limitations of Skeleton-of-Thought Prompting

The quality of generated answers using the SoT approach was evaluated using GPT-4 judges without any involvement of human experts. Hence, the answer quality evaluation isn't perfect.
SoT doesn't consider dependencies between points in the skeleton. As a result, when there is interdependence between points in the skeleton, the generated detailed answer may not be comprehensive.
LLMs available via API are charged depending on the token usage. Employing SoT may increase token usage and, hence, the bills.

Conclusion

SoT can boost the inference speed of the model by over 2 times using parallelization. SoT is easy to implement and can be implemented with a few simple modifications to any prompt. However, the quality of the generated response may not be optimal, and humans need to evaluate it before deciding to use SoT in a production environment.

Footnotes

Xuefei Ning. (2023). Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation. ↩

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Edit this page

◆ Recursion of Thought

Special Topics

Master Generative AI with Our Courses

Need Business GenAI Training?

Contact Sales

Want to keep learning

Explore Our Full Course Collection

On this page

What is Skeleton-of-Thought Prompting?
How to Use Skeleton-of-Thought Prompting?
What Are Skeleton-of-Thought Prompting Results?
Limitations of Skeleton-of-Thought Prompting
Conclusion

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

Live Courses

◆ Skeleton-of-Thought Prompting

What is Skeleton-of-Thought Prompting?

How to Use Skeleton-of-Thought Prompting?

Skeleton Stage

Prompt

Point-Expanding Stage

Prompt

What Are Skeleton-of-Thought Prompting Results?

Limitations of Skeleton-of-Thought Prompting

Conclusion

Footnotes

Valeriia Kuka

Master Generative AI with Our Courses

Contact Sales

Explore Our Full Course Collection

Explore Courses

Resources

Follow Us