Announcing our new Paper: The Prompt Report, with Co-authors from OpenAI & Microsoft!

Check it out →
🧠 AdvancedDecomposition◆ Skeleton-of-Thought

◆ Skeleton-of-Thought Prompting

Last updated on September 4, 2024 by Bhuwan Bhatt
Takeaways
  • Skeleton-of-Thought (SoT) prompting enhances response generation by first creating a basic structure (skeleton) and then expanding it in parallel, reducing latency.
  • Two-stage process: SoT divides generation into a skeleton phase followed by a detailed expansion phase, improving efficiency and speed.
  • Faster inference: SoT delivers over 2x speed improvement on 8 out of 12 models, making it ideal for real-time applications.
  • Quality improvement: In 60% of cases, SoT generates answers with quality equal to or better than traditional methods.
  • Limitations include higher token usage costs and potential quality issues when points in the skeleton are interdependent.

What is Skeleton-of-Thought Prompting?

Most state-of-the-art large language models (LLMs) rely on sequential decoding, which can lead to high latency. In contrast, humans approach problem-solving by first creating an outline or skeleton of their answer, then filling in details and supporting evidence.

Skeleton-of-Thought (SoT) prompting1 mimics this parallel process. It first instructs the LLM to generate a basic answer structure (the skeleton), and then expands on each point to create a detailed response. To optimize for speed, the detailed generation phase uses parallel API calls or batched decoding, reducing latency compared to traditional methods.

How to Use Skeleton-of-Thought Prompting?

SoT generates answers in two stages:

  • Skeleton stage
  • Point-expanding stage

Skeleton Stage

In the skeleton stage, SoT utilizes the skeleton prompt template to generate a skeleton answer.

Astronaut

Prompt


[User:] You're an organizer responsible for only giving the skeleton (not the full content) for answering the question. Provide the skeleton in a list of points (numbered 1., 2., 3., etc.) to answer the question. Instead of writing a full sentence, each skeleton point should be very short with only 3~5 words. Generally, the skeleton should have 3~10 points. Now, please provide the skeleton for the following question.

{question}

Skeleton:

[Assistant:] 1.

This skeleton can be directly fed to LLM to get the skeleton answer.

Let's use SoT to gain tips on reducing carbon emissions on a personal level.

Point-Expanding Stage

In this stage, SoT utilizes the point-expanding prompt template to expand the answer generated in the previous stage.

Astronaut

Prompt


[User:] You're responsible for continuing the writing of one and only one point in the overall answer to the following question.

{question}

The skeleton of the answer is

{skeleton}

Continue and only continue the writing of point {point index}. Write it very shortly in 1~2 sentence and do not continue with other points!

[Assistant:] {point index}.{Point skeleton}

The LLM is fed with the points generated using the previous stage and is asked to expand one point at a time. This is repeated for all points in the skeleton. This process can be parallelized to speed up the inference.

  • For LLMs with only API access, multiple parallel API requests can be sent to the provider.
  • For LLMs running locally, inference can be optimized by performing the operations in batch.

Now, let's use the point-expanding prompt to expand our previous skeleton.

What Are Skeleton-of-Thought Prompting Results?

  • On 8 out of 12 models, SoT obtains a speed-up of at least 2x.

Speed-up gained after employing SoT (Ning et al.)

  • The quality of answers generated by SoT is either comparable or better than that of normal generation in 60% of the cases.

Quality evaluation across two metrics: FastChat and LLMZoo (Ning et al.)

Limitations of Skeleton-of-Thought Prompting

  • The quality of generated answers using the SoT approach was evaluated using GPT-4 judges without any involvement of human experts. Hence, the answer quality evaluation isn't perfect.
  • SoT doesn't consider dependencies between points in the skeleton. As a result, when there is interdependence between points in the skeleton, the generated detailed answer may not be comprehensive.
  • LLMs available via API are charged depending on the token usage. Employing SoT may increase token usage and, hence, the bills.

Conclusion

SoT can boost the inference speed of the model by over 2 times using parallelization. SoT is easy to implement and can be implemented with a few simple modifications to any prompt. However, the quality of the generated response may not be optimal, and humans need to evaluate it before deciding to use SoT in a production environment.

Footnotes

  1. Xuefei Ning. (2023). Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation.

Word count: 0
Copyright © 2024 Learn Prompting.