Last updated on November 8, 2024
Universal Self-Consistency is a prompting technique used to refine and improve the accuracy of answers generated by a Large Language Model (LLM). It compiles multiple responses the model has previously given and then prompts the model to choose the best answer from among them.
USC builds on the concept of self-consistency, which uses multiple reasoning paths to find the most common response as a way to improve prediction confidence. Unlike standard self-consistency, which requires exact answers (like numbers) to tally votes, USC extends this approach to free-form responses by having the LLM select the most internally consistent answer from multiple generated outputs.
USC enhances traditional self-consistency by supporting free-form answers, which is essential for tasks like summarization, open-ended Q&A, and code generation. Where previous methods required the extraction of identical answers, USC leverages LLMs to find internal consistency, making it more adaptable and reliable for diverse tasks.
Method | USC | Standard Self-Consistency | Execution-Based Self-Consistency |
---|---|---|---|
Output Requirement | Free-form or structured | Structured answers (e.g., single values) | Structured answers with execution results |
Selection Approach | Consistency-based LLM selection | Answer extraction with majority vote | Code execution to find matching outputs |
Applications | Open-ended Q&A, summarization, code generation | Math, logic, closed-form Q&A | Code generation |
[Prompted question or task]
I have generated the following responses to the question: [Prompted question or task]
Response 1: [Response 1] Response 2: [Response 2] Response 3: [Response 3] ...
Evaluate these responses. Select the most consistent response based on majority consensus. Start your answer with "The most consistent response is Response X" (without quotes).
USC is particularly useful when you need accurate answers from an LLM but don’t have predefined answer structures or a way to validate answers externally.
Suppose we want to know the largest planet in our solar system. We start by generating multiple responses:
Which planet in our solar system is the largest?
Our solar system consists of eight planets. The planets in order of size are: Jupiter, Saturn, Uranus, Neptune, Earth, Venus, Mars, and Mercury.
Among these, Jupiter is the largest. So, the largest planet in our solar system is Jupiter.
Let's do this 5 times. These are our outputs:
Moving on to step 2, we now put it all into one prompt, asking the LLM to choose the best.
I have generated the following responses to the question: Which planet in our solar system is the largest?
[list of the responses]
Evaluate these responses. Select the most consistent response based on majority consensus. Start your answer with "The most consistent response is Response X" (without quotes).
The most consistent response is Response 4.
Just to remind you, responce 4 was:
Our solar system has eight planets. The planets in order from the sun are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. If we compare the sizes of these planets, we find that Jupiter is significantly larger than the others. In fact, Jupiter is so large that over 1,300 Earths could fit inside it. Therefore, the largest planet in our solar system is Jupiter.
USC was tested across multiple benchmarks, consistently improving LLM performance compared to traditional methods. Below are results from common benchmark tasks showing USC’s effectiveness.
Task | Greedy Decoding | Random Selection | Standard Self-Consistency | USC |
---|---|---|---|---|
Math (GSM8K) | 85.7% | 82.9% | 90.4% | 90.2% |
Code Generation (ARCADE) | 26.0% | 26.8% | 30.3% | 30.1% |
Summarization (GovReport) | ROUGE-1: 38.8 | ROUGE-1: 38.5 | Not Applicable | ROUGE-1: 40.2 |
TruthfulQA (Open Q&A) | 62.1% | 62.9% | Not Applicable | 67.7% (truthfulness) |
These results highlight USC’s capacity to significantly improve LLM-generated outputs on open-ended tasks where answer extraction for voting is difficult or not feasible.
Universal Self-Consistency is a powerful, intuitive method used to maximize the accuracy and reliability of LLM responses to a given prompt by compiling multiple responses and letting the model itself decide which is the best one. While it can be time-consuming, it doesn't take very much resources and can be highly rewarding, particularly for prompts that involve free-form writing, like for an essay.
Chen, X., Aksitov, R., Alon, U., Ren, J., Xiao, K., Yin, P., Prakash, S., Sutton, C., Wang, X., & Zhou, D. (2023). Universal Self-Consistency for Large Language Model Generation. https://arxiv.org/abs/2311.17311 ↩