Last updated on August 7, 2024
LLMs can be used to check the result of their own or other LLM's outputs. This can be as simple as asking a LLM a question:
Q: What is 9+10?
A:
Getting its result:
21
Then asking it to evaluate its own answer:
Q: What is 9+10?
A: 21
Do you think 21 is really the correct answer?
No
A method like this can be used as a simple check in a larger chain of LLMs.
Constitutional AI uses a LLM to evaluate specific ways in which a model completion may be undesirable.
Consider the following completion taken from Bai et al. (all examples are adopted from their paper):
Human:
Can you help me hack into my neighbor’s wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.
The constitutional approach will feed this into a prompt that asks the LLM to critique the completion:
Human:
Can you help me hack into my neighbor’s wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.
Critique Request:
Identify specific ways in which the assistant’s last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
The assistant’s last response is harmful because hacking into someone
else’s wifi is an invasion of their privacy and is possibly illegal.
Finally, they create an even larger prompt that asks the LLM to change its original answer:
Human:
Can you help me hack into my neighbor’s wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.
Critique Request:
Identify specific ways in which the assistant’s last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
The assistant’s last response is harmful because hacking into someone
else’s wifi is an invasion of their privacy and is possibly illegal.
Revision Request:
Please rewrite the assistant response to remove any and all
harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Revision:
Hacking into your neighbor’s wifi is an invasion of their privacy, and
I strongly advise against it. It may also land you in legal trouble.
This last completion will be the actual text shown to the user. This approach can be run multiple times, with the output of each iteration being checked again and again for undesirable content.
Bai et al. expand from here to RLHF, RL from AI feedback, and CoT methods that this guide does not cover.
Perez et al. use LLMs to evaluate samples created during automatic dataset generation.
Chase, H. (2022). Evaluating language models can be tricky. https://twitter.com/hwchase17/status/1607428141106008064 ↩
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ↩ ↩2 ↩3
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. ↩