Last updated on September 27, 2024
Most state-of-the-art Large Language Models (LLMs) rely on sequential decoding, which can lead to high latency. In contrast, humans approach problem-solving by first creating an outline or skeleton of their answer, then filling in details and supporting evidence.
Skeleton-of-Thought (SoT) Prompting mimics this parallel process. It first instructs the LLM to generate a basic answer structure (the skeleton), and then expands on each point to create a detailed response. To optimize for speed, the detailed generation phase uses parallel API calls or batched decoding, reducing latency compared to traditional methods.
SoT generates answers in two stages:
In the skeleton stage, SoT utilizes the skeleton prompt template to generate a skeleton answer.
[User:] You're an organizer responsible for only giving the skeleton (not the full content) for answering the question. Provide the skeleton in a list of points (numbered 1., 2., 3., etc.) to answer the question. Instead of writing a full sentence, each skeleton point should be very short with only 3~5 words. Generally, the skeleton should have 3~10 points. Now, please provide the skeleton for the following question.
{question}
Skeleton:
[Assistant:] 1.
This skeleton can be directly fed to LLM to get the skeleton answer.
Let's use SoT to gain tips on reducing carbon emissions on a personal level.
In this stage, SoT utilizes the point-expanding prompt template to expand the answer generated in the previous stage.
[User:] You're responsible for continuing the writing of one and only one point in the overall answer to the following question.
{question}
The skeleton of the answer is
{skeleton}
Continue and only continue the writing of point {point index}. Write it very shortly in 1~2 sentence and do not continue with other points!
[Assistant:] {point index}.{Point skeleton}
The LLM is fed with the points generated using the previous stage and is asked to expand one point at a time. This is repeated for all points in the skeleton. This process can be parallelized to speed up the inference.
Now, let's use the point-expanding prompt to expand our previous skeleton.
Speed-up gained after employing SoT (Ning et al.)
Quality evaluation across two metrics: FastChat and LLMZoo (Ning et al.)
SoT can boost the inference speed of the model by over 2 times using parallelization. SoT is easy to implement and can be implemented with a few simple modifications to any prompt. However, the quality of the generated response may not be optimal, and humans need to evaluate it before deciding to use SoT in a production environment.
Xuefei Ning. (2023). Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation. β©