Announcing our new Paper: The Prompt Report, with Co-authors from OpenAI & Microsoft!

Check it out →
🧠 AdvancedSelf-Criticism◆ Self-Verification

◆ Self-Verification Prompting

Last updated on August 19, 2024 by Bhuwan Bhatt
Takeaways
  • Error Correction: Self-Verification helps LLMs fix mistakes in multi-step reasoning by verifying conclusions against the original context.
  • Dual Process: It involves generating multiple answers and verifying them by checking if conclusions match the initial conditions.
  • Improved Performance: Self-Verification boosts accuracy in reasoning tasks, including commonsense reasoning, and enhances high-performing models like InstructGPT.

What is Self-Verification Prompting?

Chain-of-Thought (CoT) prompting1 helps Large Language Models (LLMs) simulate the human thinking process and construct intermediate reasoning steps before writing the conclusion when solving complex tasks. But when solving complex tasks requiring multiple reasoning steps, a small mistake in the early steps can propagate to other steps and produce a wrong answer. CoT lacks an error correction mechanism. Some methods2 mitigate this issue by training a separate verifier to evaluate the accuracy of generated response, but, training requires a ton of human-labeled task-specific data as well as computing resources.

Humans can self-verify their answers by using the conclusion to predict the original condition provided in the question. If the original condition in the question can be derived from the conclusion, the obtained answer is correct. Self-verification prompting3 mimics this human behavior and evaluates the correctness of the generated response by using the generated response to predict the conditions in the original context.

The Self-verification process consists of two steps:

  • Forward reasoning: The LLM generates candidate answers with CoT prompting. The LLM performs sampling decoding to generate multiple candidate answers.

  • Backward verification: Each candidate answer obtained from the LLM in the previous step are verified and the answer that gets the most votes or correctly predicts the condition given the conclusion more frequently is the final answer.

How to Use Self-Verification Prompting?

Let's use Self-Verification prompting to generate an answer for the following question:

Astronaut

Prompt


Jackie has 10 apples. Adam has 8 apples. How many more apples does Jackie have than Adam?

Forward Reasoning

  • Generate sample answers using few-shot CoT and sampling decoding. You can change the temperature value, top k, or top p variable to get a variety of answers.
ParameterSupported valuesUse
TemperatureFloating-point number in the range 0.0 (same as greedy decoding) to 2.0 (maximum creativity)Higher values lead to greater variability
Top KInteger in the range 1 to 100Higher values lead to greater variability
Top PFloating-point number in the range 0.0 to 1.0Unless you change the value, this setting is not used

Answer A1:

Answer A2:

Backward Verification

This step consists of multiple sub-steps. Let's go through each of them:

Rewritten Candidate Conclusion

Rewrite the original question with the candidate's answer in a declarative form. You can use the following prompt template:

Please change the questions and answers into complete declarative sentences [q] The answer is [y].

Declarative form for Answer A1:

Declarative form for Answer A2:

Rewritten Condition/ Condition Masking

  • Mask one of the conditions in the declarative and prepare new questions for the LLM. The new questions can be either true-false questions or questions asking the LLM to predict the masked value. Sample questions from the above declarative form:

    • Jackie has X apples, and Adam has 8 apples, so Jackie has 18 more apples than Adam. What is the value of X?
    • Jackie has X apples. Adam has 8 apples. Jackie has 2 more apples than Adam. What is the value of X?
  • True-false questions are suitable for non-arithmetic tasks. We won't be using them in our demonstration, but they can be of the form:

    • Adam has 8 apples, so Jackie has 18 more apples than Adam. Jackie has a total of 10 apples. Is it correct(True or False)?
    • Adam has 8 apples. Jackie has 2 more apples than Adam. Jackie has a total of 10 apples. Is it correct(True or False)?

Verification

  • Finally, pass the re-written conditions to the LLM and predict the masked value (or true/false). For each rewritten condition, repeat the process P (say 5) times and increment the score of the respective condition when the LLM predicts the condition correctly.
  • The answer corresponding to the condition with maximum votes is the final answer.

Here, the later condition correctly predicts the value of X more often, and hence, the answer that corresponds to this rewritten condition, i.e., "A2: Jackie has 10 apples, so Jackie has 10-8=2 more apples than Adam, and the answer is 2." is the correct answer.

What Are Self-Verification Prompting Results?

  • Self-verification improves the performance of prior methods in all datasets. It also achieves the new state-of-the-art (SOTA) performance in 6 of the 8 datasets.
  • Even high-performing forward reasoning models like InstructGPT improve by an average of 2.33% when using self-verification implying models with strong forward reasoning capabilities also benefit from the self-verification mechanism.
  • Self-verification technique can effectively improve the accuracy of commonsense reasoning models.

Impact of the verification stage in accuracy3

Limitations of Self-Verification Prompting

  • The effectiveness of self-verification relies on the ability of the model to generate the correct answer as one of the candidate answers. As such, smaller language models may not benefit from this technique.
  • The method requires generating multiple candidate inference chains. This increases the computational cost for inference.

Conclusion

Just like humans, LLMs are capable of self-verifying their own answers without relying on an external trained model for answer verification. Self-verification improves the accuracy and reliability of LLMs in reasoning tasks without the need for separate models trained on human-labeled data.

Footnotes

  1. Jason Wei. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

  2. Shen, J., Yin, Y., Li, L., Shang, L., Jiang, X., Zhang, M., & Liu, Q. (2021). Generate & Rank: A Multi-task Framework for Math Word Problems. https://arxiv.org/abs/2109.03034

  3. Yixuan Weng. (2022). Large Language Models are Better Reasoners with Self-Verification. https://arxiv.org/abs/2212.09561 2

Word count: 0
Copyright © 2024 Learn Prompting.