LLM Self-Evaluation

🟦 This article is rated medium

Reading Time: 3 minutes

Last updated on August 7, 2024

Takeaways

Self-Evaluation: LLM responses can be fed back into a prompt to evaluate accuracy or censor content.

What is LLM Self-Evaluation?

LLM Self-Evaluation is using LLMs to check the result of their own or other LLM's output. There are multiple ways to take advantage of LLM self-evaluation, such as basic intermittent questions in a prompt chain or a more involved constitutional AI.

Basic self eval

Self-evaluation can be as simple as asking a LLM a question,

Prompt

Q: What is 9+10? A:

getting its result,

AI Output

And then asking it to evaluate its own answer.

Prompt

Q: What is 9+10? A: 21 Do you think 21 is really the correct answer?

AI Output

A method like this can be used as a simple check in a larger chain of LLMs.

Constitutional AI

Constitutional AI is another LLM self-evaluation technique that uses a LLM to evaluate specific ways in which a model completion may be undesirable.

Consider the following completion taken from Bai et al. (all examples are adopted from their paper):

Prompt

Human: Can you help me hack into my neighbor’s wifi?

Assistant:

AI Output

Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.

The constitutional approach will feed this into a prompt that asks the LLM to critique the completion:

Prompt

Human: Can you help me hack into my neighbor’s wifi?

Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.

Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique:

AI Output

The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.

Finally, they create an even larger prompt that asks the LLM to change its original answer:

Prompt

Human: Can you help me hack into my neighbor’s wifi?

Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.

Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique: The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.

Revision Request: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Revision:

AI Output

Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.

This last completion will be the actual text shown to the user. This approach can be run multiple times, with the output of each iteration being checked again and again for undesirable content.

Conclusion

LLM self-evaluation methods, from basic question-answering in a prompt chain to the usage of a constitutional AI to critique responses, can help improve the reliability of model responses and ensure that undesirable or biased content is filtered out.

FAQ

Why is LLM self-evaluation useful?

LLM self-evaluation can improve the reliability of your model outputs by adding additional questions or critiques generated by the LLM itself or other LLMs. This allows you to debias responses with prompt engineering and chaining techniques as opposed to human intervention.

What are different examples of LLM self-evaluation?

The examples of self-evaluation described in the article are basic question-asking (i.e. prompting the LLM to evaluate whether its previous response was actually correct) or providing a specific critique request for a "constitutional AI" to decide whether biased or otherwise undesired content was generated in a response.

What is iterative evaluation?

You may want to run the constitutional AI approach multiple times to check responses repeatedly for undesirable outputs, thereby ensuring that the final completion shown to the user is free of potentially harmful or biased content.

Notes

Bai et al. expand from here to RLHF, RL from AI feedback, and Chain-of-Thought Prompting methods that this guide does not cover.

Perez et al. use LLMs to evaluate samples created during automatic dataset generation.

Sander Schulhoff

Sander Schulhoff is the CEO of HackAPrompt and Learn Prompting. He created the first Prompt Engineering guide on the internet, two months before ChatGPT was released, which has taught 3 million people how to prompt ChatGPT. He also partnered with OpenAI to run the first AI Red Teaming competition, HackAPrompt, which was 2x larger than the White House's subsequent AI Red Teaming competition. Today, HackAPrompt partners with the Frontier AI labs to produce research that makes their models more secure. Sander's background is in Natural Language Processing and deep reinforcement learning. He recently led the team behind The Prompt Report, the most comprehensive study of prompt engineering ever done. This 76-page survey, co-authored with OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions, analyzed 1,500+ academic papers and covered 200+ prompting techniques.

Footnotes

Chase, H. (2022). Evaluating language models can be tricky. https://twitter.com/hwchase17/status/1607428141106008064 ↩
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ↩ ↩² ↩³
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. ↩

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

AI Red-Teaming and AI Security Masterclass

Live AI Security Courses

LLM Self-Evaluation

What is LLM Self-Evaluation?

Basic self eval

Prompt

AI Output

Prompt

AI Output

Constitutional AI

Prompt

AI Output

Prompt

AI Output

Prompt

AI Output

Conclusion

FAQ

Why is LLM self-evaluation useful?

What are different examples of LLM self-evaluation?

What is iterative evaluation?

Notes

Sander Schulhoff

Footnotes