🧠 AdvancedSelf-Criticism🟢 Chain-of-Verification (CoVe)

🟢 Chain-of-Verification (CoVe) Prompting

🟢 This article is rated easy

Reading Time: 4 minutes

Last updated on September 27, 2024

Takeaways

Reduces hallucinations: CoVe minimizes factual errors in Large Language Models' (LLMs) responses by using verification questions.
Improved accuracy: CoVe outperforms Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) methods in generating accurate responses.
Four-step process: It involves baseline response, verification planning, answering verification questions, and refining the final output.
Limitations: CoVe reduces but doesn't fully eliminate hallucinations, especially in reasoning steps.

What is CoVe Prompting?

Often, Large Language Models (LLMs) generate responses that look plausible but are incorrect. This is referred to as hallucination in LLM and is an unresolved issue. Hallucinations are more apparent for lesser-known facts or facts that appear less frequently in the training corpus and when generating long-form texts (multiple sentences and paragraphs). For reasoning tasks, Chain-of-Thought (CoT) improves the quality of response by encouraging the model to generate its thoughts before producing the final response. Chain-of-Verification (CoVe) follows a similar approach and uses verification questions to improve the quality of response.

Chain-of-Verification (CoVe) prompting is a self-critique technique that improves upon the initial response from the LLM by first planning the verification questions to check its work and then systematically answering those questions to produce a refined response. The answers to these verification questions are usually short and easy compared to the initial answer. The initial answer is improved by providing verification questions and answers to the LLM as feedback.

Here is an example employing CoVe to refine the original answer:

Query

Prompt

Name some politicians who were born in NY, New York.

Baseline Response

AI Output

Here are some politicians who were born in NY, New York: Hillary Clinton - former secretary of state and former Democratic presidential nominee Donald Trump - former president of the United States Michael Bloomberg - former Mayor of New York City and former Democratic presidential candidate ...

Plan verifications

Prompt

Where was Hillary Clinton born? Where was Donald Trump born? Where was Michael Bloomberg born? ...

Execute verifications

AI Output

Hillary Clinton was born in Chicago, Illinois, United States on October 26, 1947. Donald Trump was born on June 14, 1946, in Queens, New York City, New York, United States Michael Bloomberg was born on February 14, 1942, in Boston, Massachusetts, United States. ...

Final verified response

AI Output

Here are some politicians who were born in NY, New York: Donald Trump - former president of the United States Alexandria Ocasio-Cortez - Democratic member of the U.S. House of Representatives ...

As seen in the example above, the individual verification questions are answered more accurately than in the original long-form answer.

How to Use CoVe Prompting?

Answer generation using CoVe is a 4-step process:

Baseline response generation
In this step, you prompt the model to generate a draft of the answer. The answer may not be factually correct.
Planning verifications
In this step, you generate a list of verification questions using the query and initial response.
Verification execution
This step prompts the LLM to generate a response for each of the verification questions.
Final response generation
In this step, you incorporate the answer from the previous step and generate a final response.

Let's go through an example to understand the process better.

Let's say we want to find the names of politicians born in New York City (NYC).

Step 1: Baseline response generation
Prompt the model to get the list of politicians born in NYC.

Step 2: Planning verifications
Generate a list of verification questions to verify the model's answer.

Step 3: Verification execution
Get answers for each of the verification questions.

Step 4: Final response generation
Use the answer from the previous step to refine the original answer.

As expected, the final refined response from the LLM only consists of those individuals who were born in NYC, as revealed by the answers to the verification questions in step 3.

Please note that unlike the examples above, the original paper uses Few-Shot Prompting to execute the entire process.

What Are CoVe Prompting Results?

Experiments using CoVe for list-based and long-form generation show that:

CoVe performs significantly better than Zero-Shot, Few-Shot, and CoT. Hallucinations are reduced after employing CoVe.

Test Precision and average number of positive and negative (hallucination)

CoVe improves performance on closed book QA. Employing CoVe improves the F1 score by 23% (from 0.39 to 0.48).

CoVe's performance against Zero-Shot and Few-Shot in closed book MultiSpanQ

On long-form content generation, CoVe-based Llama outperforms InstructGPT, ChatGPT and PerplexityAI.

CoVe against InstructGPT, ChatGPT and PerplexityAI

Limitations of CoVe Prompting

CoVe reduces the hallucinations in the generated response but doesn't remove them completely.
The paper reduces hallucinations in the form of directly stated factual inaccuracies. The paper doesn't gauge the effectiveness of CoVe in reducing other forms of hallucinations, such as incorrect reasoning steps.
The CoVe approach relies on the LLM for finding its own inaccuracies. If the model cannot find its own inaccuracies, it won't benefit from CoVe.

Conclusion

Hallucinations are common in LLM responses, especially when the generated response is a long passage comprising multiple sentences. Such hallucinations degrade the quality of the generated response. CoVe is a simple and effective technique to reduce the hallucinations from an LLM response without any training or fine-tuning. However, the paper doesn't study the effectiveness of CoVe in reducing hallucinations other than factual inaccuracies.

Bhuwan Bhatt

Bhuwan Bhatt, a Machine Learning Engineer with over 5 years of industry experience, is passionate about solving complex challenges at the intersection of machine learning and Python programming. Bhuwan has contributed his expertise to leading companies, driving innovation in AI/ML projects. Beyond his professional endeavors, Bhuwan is deeply committed to sharing his knowledge and experiences with others in the field. He firmly believes in continuous improvement, striving to grow by 1% each day in both his technical skills and personal development.

Footnotes

Jason Wei. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. ↩
Shehzaad Dhuliawala. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models. https://arxiv.org/abs/2309.11495 ↩ ↩² ↩³ ↩⁴