Chain-of-Thought (CoT) Prompting is a technique that enhances the reasoning capabilities of large language models (LLMs) by incorporating logical steps—or a “chain of thought”—within the prompt. Unlike direct-answer prompting, CoT guides the model to work through intermediate reasoning steps, making it more adept at solving complex tasks like math problems, commonsense reasoning, and symbolic manipulation.
Traditional prompts typically consist of simple input-output examples and lack explicit reasoning steps, making it challenging for models to infer the necessary logic for tasks requiring multi-step reasoning. CoT prompting addresses this by:
The example below illustrates the difference between few-shot prompting (left) and CoT prompting (right). While the traditional approach goes directly to the solution, CoT guides the model to lay out its reasoning process, often resulting in more accurate and interpretable outcomes.
The key concept of CoT is that by providing a few examples (or exemplars), where the reasoning process is explicitly shown, the LLM learns to include reasoning steps in its responses. This structured approach to thinking often results in more accurate outputs.
With CoT, the model essentially “talks through” its thought process, leading to more reliable answers.
CoT prompting is especially valuable for tasks where structured reasoning is crucial:
Q: John has 10 apples. He gives away 4 and then receives 5 more. How many apples does he have?
A:
Q: [Your Question]
Here are two demos illustrating how CoT prompting improves outcomes. The first demo shows GPT-3 (davinci-003) struggling with a word problem without CoT, while the second shows it succeeding using CoT.
Research has shown that CoT prompting can significantly enhance LLM accuracy on tasks like arithmetic, commonsense, and symbolic reasoning. For instance, a prompted PaLM 540B model achieved a 57% solve rate accuracy on GSM8K, setting a state-of-the-art (SOTA) benchmark at the time.
The table below summarizes the performance improvements on key benchmarks when using CoT prompting:
Task | Model | Standard Prompting Accuracy | CoT Prompting Accuracy | Improvement |
---|---|---|---|---|
GSM8K (Math) | PaLM 540B | 55% | 74% | +19% |
SVAMP (Math) | PaLM 540B | 57% | 81% | +24% |
Commonsense (CSQA) | PaLM 540B | 76% | 80% | +4% |
Symbolic Reasoning | PaLM 540B | ~60% | ~95% | +35% |
Importantly, according to CoT authors, CoT only yields performance gains when used with models of ∼100B parameters. Smaller models wrote illogical chains of thought, which led to worse accuracy than standard prompting. Models usually get performance boosts from CoT prompting in a manner proportional to the size of the model.
Chain-of-Thought Prompting is a powerful method for unlocking reasoning capabilities in large language models. By encouraging step-by-step thinking, CoT prompting allows models to perform complex reasoning tasks effectively without needing additional training data. The benefits are particularly pronounced in large models (e.g., models with over 100 billion parameters), which exhibit improved reasoning capacities as they follow these structured reasoning prompts.
Chain-of-Thought prompting works by providing the model with examples of logical reasoning. When shown how to approach problems in a step-by-step way, the LLM is more likely to emulate this approach, resulting in responses that are both accurate and reliable.
CoT prompting is less effective with smaller models. To achieve meaningful gains, it’s best to apply CoT in proportion to the model’s size, as smaller models may produce less coherent reasoning with CoT prompting.
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. ↩ ↩2 ↩3 ↩4
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., … Fiedel, N. (2022). PaLM: Scaling Language Modeling with Pathways. ↩
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems. ↩