Last updated on October 3, 2024
DiVeRSe (Diverse Verifier on Reasoning Steps)^{} is a method designed to enhance the reasoning abilities of Large Language Models (LLMs) by improving the way they handle multi-step problems.
LLMs still struggle with complex tasks like arithmetic word problems. DiVeRSe tackles this by adding three major components:
Diverse Prompts: DiVeRSe generates multiple reasoning paths by sampling from different prompts. DiVeRSe randomly selects $M_1$ different prompts for each question, and then sample $M_2$ reasoning paths for each prompt using sampling decoding. This way, you obtain $M = M1 × M2$ diverse reasoning paths for each question.
Voting Verifier: Once the model has generated several reasoning paths, the voting verifier comes into play. It evaluates each reasoning path, scoring how likely it is to be correct. This is done using a pre-trained model which takes into account both the question and the reasoning steps. The verifier guides a voting mechanism, weighting paths based on their probability of being correct rather than simply counting how many paths lead to a specific answer.
Step-Aware Verification: A major innovation of DiVeRSe is its step-aware verifier, which checks the correctness of each individual step in the reasoning chain. Often, some steps may be correct while others are wrong, leading to an incorrect final answer. DiVeRSe identifies these mistakes by labeling each step and comparing it to known correct reasoning patterns. This helps improve the overall reasoning process by pinpointing where the error occurs and correcting it.
DiVeRSe can be applied to a range of reasoning tasks, especially those that require step-by-step logic. Here’s how to use on a math problem.
Sample multiple reasoning paths for a given question by generating different prompts.
Q: Janet’s ducks lay 16 eggs per day. She eats 3 for breakfast every morning and uses 4 eggs for baking muffins. She sells the remaining eggs for $2 each. How much money does she make per day?
A:
Generated Reasoning Paths:
[Sample 1] 16 - 3 = 13 eggs left, 13 - 4 = 9 eggs left. She sells 9 eggs for $2 each, so 9 * 2 = $18.
[Sample 2] 16 - 3 = 13 eggs, 13 - 4 = 9 eggs, 9 eggs sold for $2 each, so $18.
Use the verifier to score each path based on its likelihood of being correct.
- Path 1: 91.2% correct.
- Path 2: 88.5% correct.
Apply step-aware verification to check the correctness of individual reasoning steps.
- Step 1: Correct subtraction (16 - 3 = 13).
- Step 2: Correct subtraction (13 - 4 = 9).
- Step 3: Correct multiplication (9 * 2 = 18).
Use weighted voting to arrive at the final answer, selecting the most likely correct answer based on the verified reasoning paths.
Final Answer: $18.
Acess the open-source code.
DiVeRSe was evaluated on several reasoning tasks, including arithmetic reasoning (e.g., GSM8K, MultiArith), commonsense reasoning (e.g., CommonsenseQA), and inductive reasoning (e.g., CLUTRR). The method achieved state-of-the-art results on many of these benchmarks, outperforming previous approaches like self-consistency and greedy decoding.
Task | Previous SOTA | Self-Consistency | DiVeRSe |
---|---|---|---|
GSM8K | 74.4% | 76.7% | 82.3% |
AsDiv | 81.9% | 86.2% | 88.7% |
MultiArith | 99.3% | 98.6% | 99.8% |
SVAMP | 86.6% | 85.8% | 87.0% |
SingleEq | 79.5% | 93.7% | 94.9% |
CLUTRR | 67.0% | 35.6% | 95.9% |
DiVeRSe offers a powerful method to enhance the reasoning abilities of large language models by leveraging diverse prompts, verifier-based scoring, and step-aware verification. This approach not only improves overall accuracy but also provides finer control over the reasoning process, allowing for more reliable and interpretable results. As LLMs continue to evolve, DiVeRSe represents a step forward in making these models more capable and trustworthy in complex reasoning tasks.
Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.-G., & Chen, W. (2023). Making Large Language Models Better Reasoners with Step-Aware Verifier. https://arxiv.org/abs/2206.02336 ↩ ↩^{2}