**Announcing our new Course: AI Red-Teaming and AI Safety Masterclass**

😃 Basics

💼 Applications

🧙♂️ Intermediate

🧠 Advanced

Zero-Shot

Thought Generation

Ensembling

Self-Criticism

🌱 New Techniques

👀 For Vision-Language Models (VLMs)

🔀 For Multimodal Large Language Models (MLLMs)

⚖️ Reliability

🖼️ Image Prompting

🔓 Prompt Hacking

🟢 Defensive Measures

🔨 Tooling

💪 Prompt Tuning

📝 Language Models

**Calibrating LLMs**, or more specifically their **output
distributions**^{1}, is another way of counteracting some of the
biases they exhibit. Let's walk through a quick example: Say we have a
sentiment analysis
task with two possible labels, `Positive`

and `Negative`

. Consider what
happens when the LLM is prompted with `Input: nothing Sentiment: `

. This input doesn't contain any *context* which the LLM can use to
make a sentiment prediction, so it is called a **context-free** input.

Since `nothing`

is neither a positive nor a negative concept, we would expect the LLM to output a probability of about 0.5 for both `Positive`

and `Negative`

. However, often (and for this example) that will not be the case.

p("Positive" | "Input: nothing Sentiment:") = 0.9

p("Negative" | "Input: nothing Sentiment:") = 0.1

Given these label probabilities for a context-free input, we know that the LLM's
**output distribution** is likely biased
towards the label `Positive`

. This may cause the LLM to favor `Positive`

for all inputs, even if the input is not actually positive.

If we can somehow **calibrate** the output distribution, such that context-free
inputs are assigned a probability of 0.5 for both `Positive`

and `Negative`

,
then we can often remove the bias towards `Positive`

and the LLM will be more reliable
on both context-free inputs and inputs with context.

A non-technical solution to this problem is to simply provide Few-Shot examples where
context-free exemplars are effectively assigned a probability of 0.5 for both
`Positive`

and `Negative`

.

For example, we could provide the following Few-Shot examples which show each context-free
exemplar being classified as both `Positive`

and `Negative`

:

Input: I hate this movie. Sentiment: Negative Input: I love this movie. Sentiment: Positive Input: N/A Sentiment: Positive Input: N/A Sentiment: Negative Input: nothing Sentiment: Positive Input: nothing Sentiment: Negative Input: I like eggs. Sentiment:

To my knowledge, this solution has not been explored in the literature, and I am not sure how well it works in practice. However, it is a simple solution that demonstrates what calibration is trying to achieve.

Another solution for calibrating LLMs is **contextual calibration**^{1}, where we
adjust special calibration parameters, which ensure that context-free inputs (like
`Input: nothing Sentiment: `

) are assigned a probability of about 0.5 for both labels.
Note that in practice this method performs calibration over multiple different context-free inputs (e.g. `Input: N/A Sentiment: `

, `Input: [MASK] Sentiment: `

). It averages the calibration parameters that
work best for each context-free input to find the best calibration parameters for the LLM.

Let's go through an example of computing the calibration parameters for one context-free input. Note that
this example is not reproducible with GPT-3 because it can't be restricted to the labels `Positive`

and `Negative`

.

Consider again the above example where the LLM assigns the following probabilities to the labels for a context-free input:

p("Positive" | "Input: nothing Sentiment:") = 0.9

p("Negative" | "Input: nothing Sentiment:") = 0.1

We want to find some probability distribution q such that:

```
q("Positive" | "Input: nothing Sentiment:") = 0.5
q("Negative" | "Input: nothing Sentiment:") = 0.5
```

We will do so by creating a linear transformation that adjusts (calibrates) the probabilities of $p$.

$\hat q = \text{Softmax}(W\hat p + b)$

This equation takes the original probabilities $\hat p$ and applies the weights $W$ and bias $b$ to them. The weights $W$ and bias $b$ are the calibration parameters, which, when applied to the context-free example's probabilities, will yield $\hat p$ = [0.5, 0.5].

We need to somehow compute the weights $W$ and bias $b$. One way to do this is:

$W = \text{diag}(\hat p)^{-1}$

$b = 0$

Although the definition of $W$ may seem a bit strange at first, it is just taking the inverse of each value in $\hat p$ to find a $W$ that will transform the original probabilities $\hat p$ into the calibrated probabilities [0.5, 0.5].

Let's verify that this works for the example above:

$\hat p = [0.9, 0.1]$

$W = \text{diag}(\hat p)^{-1} = \text{diag}([0.9, 0.1])^{-1} = \begin{bmatrix} 0.9 & 0 \\ 0 & 0.1 \end{bmatrix}^{-1} = \begin{bmatrix} 1.11 & 0 \\ 0 & 10 \end{bmatrix}$

$\hat q = \text{Softmax}(W\hat p + b) = \text{Softmax}(\begin{bmatrix} 1.11 & 0 \\ 0 & 10 \end{bmatrix}*{[0.9, 0.1]} + 0) = \text{Softmax}([1, 1]) =[0.5, 0.5]$

As mentioned above, we would perform this same process for multiple different context-free inputs, and average the calibration parameters that work best for each context-free input to find the best calibration parameters for the LLM. This means that the final calibration parameters will probably not map any of the context-free inputs to exactly [0.5, 0.5].

$b$ could also be set to $-\hat p$, and $W$ to the identity matrix. This method performs
better on generation rather than classification tasks^{1}.

Our model responses are often predisposed, or biased, towards certain labels. Calibrating LLMs can be used to counteract this bias.

Calibrating LLMs is an important and more advanced step toward mitigating the biases that may be produced by LLM outputs. With the methods introduced in this article, we can achieve more balanced responses.

A non-technical solution introduced in this article is introducing diverse Few-Shot examples directly into the model's input prompts. A more technical solution uses the strategy of contextual calibration, which computes calibration parameters that work best for context-free inputs.

Word count: 0

Copyright © 2024 Learn Prompting.