Announcing our new Course: AI Red-Teaming and AI Safety Masterclass
Check it out →Chain-of-Thought (CoT) prompting^{1} helps large language models (LLMs) simulate the human thinking process and construct intermediate reasoning steps before writing the conclusion when solving complex tasks. But when solving complex tasks requiring multiple reasoning steps, a small mistake in the early steps can propagate to other steps and produce a wrong answer. CoT lacks an error correction mechanism. Some methods^{2} mitigate this issue by training a separate verifier to evaluate the accuracy of generated response, but, training requires a ton of human-labeled task-specific data as well as computing resources.
Humans can self-verify their answers by using the conclusion to predict the original condition provided in the question. If the original condition in the question can be derived from the conclusion, the obtained answer is correct. Self-Verification prompting^{3} mimics this human behavior and evaluates the correctness of the generated response by using the generated response to predict the conditions in the original context.
The Self-verification process consists of two steps:
Forward reasoning: The LLM generates candidate answers with CoT prompting. The LLM performs sampling decoding to generate multiple candidate answers.
Backward verification: Each candidate answer obtained from the LLM in the previous step are verified and the answer that gets the most votes or correctly predicts the condition given the conclusion more frequently is the final answer.
Let's use Self-Verification prompting to generate an answer for the following question:
Jackie has 10 apples. Adam has 8 apples. How many more apples does Jackie have than Adam?
Parameter | Supported values | Use |
---|---|---|
Temperature | Floating-point number in the range 0.0 (same as greedy decoding) to 2.0 (maximum creativity) | Higher values lead to greater variability |
Top K | Integer in the range 1 to 100 | Higher values lead to greater variability |
Top P | Floating-point number in the range 0.0 to 1.0 | Unless you change the value, this setting is not used |
Answer A1:
Answer A2:
This step consists of multiple sub-steps. Let's go through each of them:
Rewrite the original question with the candidate's answer in a declarative form. You can use the following prompt template:
Please change the questions and answers into complete declarative sentences [q] The answer is [y].
Declarative form for Answer A1:
Declarative form for Answer A2:
Mask one of the conditions in the declarative and prepare new questions for the LLM. The new questions can be either true-false questions or questions asking the LLM to predict the masked value. Sample questions from the above declarative form:
True-false questions are suitable for non-arithmetic tasks. We won't be using them in our demonstration, but they can be of the form:
Here, the later condition correctly predicts the value of X more often, and hence, the answer that corresponds to this rewritten condition, i.e., "A2: Jackie has 10 apples, so Jackie has 10-8=2 more apples than Adam, and the answer is 2." is the correct answer.
Impact of the verification stage in accuracy^{3}
Just like humans, LLMs are capable of self-verifying their own answers without relying on an external trained model for answer verification. Self-verification improves the accuracy and reliability of LLMs in reasoning tasks without the need for separate models trained on human-labeled data.
Jason Wei. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. ↩
Shen, J., Yin, Y., Li, L., Shang, L., Jiang, X., Zhang, M., & Liu, Q. (2021). Generate & Rank: A Multi-task Framework for Math Word Problems. https://arxiv.org/abs/2109.03034 ↩
Yixuan Weng. (2022). Large Language Models are Better Reasoners with Self-Verification. https://arxiv.org/abs/2212.09561 ↩ ↩^{2}