Announcing our new Paper: The Prompt Report, with Co-authors from OpenAI & Microsoft!

Check it out →
🧠 AdvancedSelf-Criticism🟢 Chain-of-Verification (CoVe)

🟢 Chain-of-Verification (CoVe) Prompting

Last updated on August 19, 2024 by Bhuwan Bhatt
Takeaways
  • Reduces hallucinations: CoVe minimizes factual errors in LLM responses by using verification questions.
  • Improved accuracy: CoVe outperforms zero-shot, few-shot, and Chain-of-Thought (CoT) methods in generating accurate responses.
  • Four-step process: It involves baseline response, verification planning, answering verification questions, and refining the final output.
  • Limitations: CoVe reduces but doesn't fully eliminate hallucinations, especially in reasoning steps.

What is CoVe Prompting?

Often, Large Language Models (LLMs) generate responses that look plausible but are incorrect. This is referred to as hallucination in LLM and is an unresolved issue. Hallucinations are more apparent for lesser-known facts or facts that appear less frequently in the training corpus and when generating long-form texts (multiple sentences and paragraphs). For reasoning tasks, CoT1 improves the quality of response by encouraging the model to generate its thoughts before producing the final response. Chain-of-Verification2 (CoVe) follows a similar approach and uses verification questions to improve the quality of response.

Chain-of-Verification (CoVe) prompting2 is a self-critique technique that improves upon the initial response from the LLM by first planning the verification questions to check its work and then systematically answering those questions to produce a refined response. The answers to these verification questions are usually short and easy compared to the initial answer. The initial answer is improved by providing verification questions and answers to the LLM as feedback.

Here is an example employing CoVe to refine the original answer:

  • Query
Astronaut

Prompt


Name some politicians who were born in NY, New York.

  • Baseline Response
Robot

AI Output


Here are some politicians who were born in NY, New York: Hillary Clinton - former secretary of state and former Democratic presidential nominee Donald Trump - former president of the United States Michael Bloomberg - former Mayor of New York City and former Democratic presidential candidate ...

  • Plan verifications
Astronaut

Prompt


Where was Hillary Clinton born? Where was Donald Trump born? Where was Michael Bloomberg born? ...

  • Execute verifications
Robot

AI Output


Hillary Clinton was born in Chicago, Illinois, United States on October 26, 1947. Donald Trump was born on June 14, 1946, in Queens, New York City, New York, United States Michael Bloomberg was born on February 14, 1942, in Boston, Massachusetts, United States. ...

  • Final verified response
Robot

AI Output


Here are some politicians who were born in NY, New York: Donald Trump - former president of the United States Alexandria Ocasio-Cortez - Democratic member of the U.S. House of Representatives ...

As seen in the example above, the individual verification questions are answered more accurately than in the original long-form answer.

How to Use CoVe Prompting?

Answer generation using CoVe is a 4-step process:

  1. Baseline response generation
    In this step, you prompt the model to generate a draft of the answer. The answer may not be factually correct.
  2. Planning verifications
    In this step, you generate a list of verification questions using the query and initial response.
  3. Verification execution
    This step prompts the LLM to generate a response for each of the verification questions.
  4. Final response generation
    In this step, you incorporate the answer from the previous step and generate a final response.

Let's go through an example to understand the process better.

Let's say we want to find the names of politicians born in New York City (NYC).

Step 1: Baseline response generation
Prompt the model to get the list of politicians born in NYC.

Step 2: Planning verifications
Generate a list of verification questions to verify the model's answer.

Step 3: Verification execution
Get answers for each of the verification questions.

Step 4: Final response generation
Use the answer from the previous step to refine the original answer.

As expected, the final refined response from the LLM only consists of those individuals who were born in NYC, as revealed by the answers to the verification questions in step 3.

Please note that unlike the examples above, the original paper uses few-shot prompting to execute the entire process.

What Are CoVe Prompting Results?

Experiments using CoVe for list-based and long-form generation show that:

  • CoVe performs significantly better than zero-shot, few-shot, and CoT. Hallucinations are reduced after employing CoVe.

Test Precision and average number of positive and negative (hallucination)2

  • CoVe improves performance on closed book QA. Employing CoVe improves the F1 score by 23% (from 0.39 to 0.48).

CoVe's performance against zero-shot and few-shot in closed book MultiSpanQ2

  • On long-form content generation, CoVe-based Llama outperforms InstructGPT, ChatGPT and PerplexityAI.

CoVe against InstructGPT, ChatGPT and PerplexityAI2

Limitations of CoVe Prompting

  • CoVe reduces the hallucinations in the generated response but doesn't remove them completely.
  • The paper reduces hallucinations in the form of directly stated factual inaccuracies. The paper doesn't gauge the effectiveness of CoVe in reducing other forms of hallucinations, such as incorrect reasoning steps.
  • The CoVe approach relies on the LLM for finding its own inaccuracies. If the model cannot find its own inaccuracies, it won't benefit from CoVe.

Conclusion

Hallucinations are common in LLM responses, especially when the generated response is a long passage comprising multiple sentences. Such hallucinations degrade the quality of the generated response. CoVe is a simple and effective technique to reduce the hallucinations from an LLM response without any training or fine-tuning. However, the paper doesn't study the effectiveness of CoVe in reducing hallucinations other than factual inaccuracies.

Footnotes

  1. Jason Wei. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

  2. Shehzaad Dhuliawala. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models. https://arxiv.org/abs/2309.11495 2 3 4 5

Word count: 0
Copyright © 2024 Learn Prompting.