Prompt Engineering Guide
πŸ˜ƒ Basics
πŸ’Ό Applications
πŸ§™β€β™‚οΈ Intermediate
🧠 Advanced
Special Topics
🌱 New Techniques
πŸ€– Agents
βš–οΈ Reliability
πŸ–ΌοΈ Image Prompting
πŸ”“ Prompt Hacking
πŸ”¨ Tooling
πŸ’ͺ Prompt Tuning
πŸ—‚οΈ RAG
🎲 Miscellaneous
Models
πŸ”§ Models
Resources
πŸ“™ Vocabulary Resource
πŸ“š Bibliography
πŸ“¦ Prompted Products
πŸ›Έ Additional Resources
πŸ”₯ Hot Topics
✨ Credits
βš–οΈ Reliability🟦 Calibrating LLMs

Calibrating LLMs

🟦 This article is rated medium
Reading Time: 5 minutes
Last updated on August 7, 2024

Sander Schulhoff

Takeaways
  • LLM Calibration: Calibrating LLMs involves adjusting their output distributions to reduce biases and ensure more balanced predictions.
  • Example: In sentiment analysis, bias can be measured using neutral inputs and countered by applying a transformation which corrects for the measured bias.

What is Calibrating LLMs?

Calibrating LLMs, or more specifically their output distributions, is another way of counteracting some of the biases they exhibit. Let's walk through a quick example: Say we have a sentiment analysis task with two possible labels, Positive and Negative. Consider what happens when the LLM is prompted with Input: nothing Sentiment: . This input doesn't contain any context which the LLM can use to make a sentiment prediction, so it is called a context-free input.

Since nothing is neither a positive nor a negative concept, we would expect the LLM to output a probability of about 0.5 for both Positive and Negative. However, often (and for this example) that will not be the case.

Robot

AI Output


p("Positive" | "Input: nothing Sentiment:") = 0.9

p("Negative" | "Input: nothing Sentiment:") = 0.1

Given these label probabilities for a context-free input, we know that the LLM's output distribution is likely biased towards the label Positive. This may cause the LLM to favor Positive for all inputs, even if the input is not actually positive.

If we can somehow calibrate the output distribution, such that context-free inputs are assigned a probability of 0.5 for both Positive and Negative, then we can often remove the bias towards Positive and the LLM will be more reliable on both context-free inputs and inputs with context.

Non-Technical Solution

A non-technical solution to this problem is to simply provide Few-Shot examples where context-free exemplars are effectively assigned a probability of 0.5 for both Positive and Negative.

For example, we could provide the following Few-Shot examples which show each context-free exemplar being classified as both Positive and Negative:

Astronaut

Prompt


Input: I hate this movie. Sentiment: Negative Input: I love this movie. Sentiment: Positive Input: N/A Sentiment: Positive Input: N/A Sentiment: Negative Input: nothing Sentiment: Positive Input: nothing Sentiment: Negative Input: I like eggs. Sentiment:

To my knowledge, this solution has not been explored in the literature, and I am not sure how well it works in practice. However, it is a simple solution that demonstrates what calibration is trying to achieve.

Technical Solution

Another solution for calibrating LLMs is contextual calibration, where we adjust special calibration parameters, which ensure that context-free inputs (like Input: nothing Sentiment: ) are assigned a probability of about 0.5 for both labels. Note that in practice this method performs calibration over multiple different context-free inputs (e.g. Input: N/A Sentiment: , Input: [MASK] Sentiment: ). It averages the calibration parameters that work best for each context-free input to find the best calibration parameters for the LLM.

Example of Contextual Calibration

Let's go through an example of computing the calibration parameters for one context-free input. Note that this example is not reproducible with GPT-3 because it can't be restricted to the labels Positive and Negative.

Consider again the above example where the LLM assigns the following probabilities to the labels for a context-free input:

Robot

AI Output


p("Positive" | "Input: nothing Sentiment:") = 0.9

p("Negative" | "Input: nothing Sentiment:") = 0.1

We want to find some probability distribution q such that:

q("Positive" | "Input: nothing Sentiment:") = 0.5

q("Negative" | "Input: nothing Sentiment:") = 0.5

We will do so by creating a linear transformation that adjusts (calibrates) the probabilities of pp.

q^=Softmax(Wp^+b)\hat q = \text{Softmax}(W\hat p + b)

This equation takes the original probabilities p^\hat p and applies the weights WW and bias bb to them. The weights WW and bias bb are the calibration parameters, which, when applied to the context-free example's probabilities, will yield p^\hat p = [0.5, 0.5].

Computing W and b

We need to somehow compute the weights WW and bias bb. One way to do this is:

W=diag(p^)βˆ’1W = \text{diag}(\hat p)^{-1}

b=0b = 0

Although the definition of WW may seem a bit strange at first, it is just taking the inverse of each value in p^\hat p to find a WW that will transform the original probabilities p^\hat p into the calibrated probabilities [0.5, 0.5].

Let's verify that this works for the example above:

p^=[0.9,0.1]\hat p = [0.9, 0.1]

W=diag(p^)βˆ’1=diag([0.9,0.1])βˆ’1=[0.9000.1]βˆ’1=[1.110010]W = \text{diag}(\hat p)^{-1} = \text{diag}([0.9, 0.1])^{-1} = \begin{bmatrix} 0.9 & 0 \\ 0 & 0.1 \end{bmatrix}^{-1} = \begin{bmatrix} 1.11 & 0 \\ 0 & 10 \end{bmatrix}

q^=Softmax(Wp^+b)=Softmax([1.110010]βˆ—[0.9,0.1]+0)=Softmax([1,1])=[0.5,0.5]\hat q = \text{Softmax}(W\hat p + b) = \text{Softmax}(\begin{bmatrix} 1.11 & 0 \\ 0 & 10 \end{bmatrix}*{[0.9, 0.1]} + 0) = \text{Softmax}([1, 1]) =[0.5, 0.5]

As mentioned above, we would perform this same process for multiple different context-free inputs, and average the calibration parameters that work best for each context-free input to find the best calibration parameters for the LLM. This means that the final calibration parameters will probably not map any of the context-free inputs to exactly [0.5, 0.5].

Another method

bb could also be set to βˆ’p^-\hat p, and WW to the identity matrix. This method performs better on generation rather than classification tasks.

Conclusion

Our model responses are often predisposed, or biased, towards certain labels. Calibrating LLMs can be used to counteract this bias.

FAQ

Why is calibrating LLMs necessary?

Calibrating LLMs is an important and more advanced step toward mitigating the biases that may be produced by LLM outputs. With the methods introduced in this article, we can achieve more balanced responses.

What are non-technical and technical solutions to calibrating an LLM output distribution?

A non-technical solution introduced in this article is introducing diverse Few-Shot examples directly into the model's input prompts. A more technical solution uses the strategy of contextual calibration, which computes calibration parameters that work best for context-free inputs.

Sander Schulhoff

Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.

Footnotes

  1. Zhao, T. Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate Before Use: Improving Few-Shot Performance of Language Models. ↩ ↩2 ↩3