logit2prompt is a technique for reconstructing a prompt by leveraging the next-token probability distributions (logits) produced by a language model (LLM). It is one of the pioneering Language Model Inversion (LMI) methods proposed in 2023 for recovering hidden prompts from model outputs.

What are Next-Token Probability Distributions (Logits)?

In language models, the next-token probability distribution, or logits, refers to the raw scores for each possible next token computed by the model before they are converted into probabilities via the softmax function. These scores indicate how likely each token is to be chosen as the next word.

Here's a brief overview of how a language model works:

Tokenization: The input text is broken down into tokens.
Prediction: At each step, a model like Llama-2 7B or Llama-2 Chat computes a score for every token in its vocabulary (which can be tens of thousands of tokens).
Logits Output: The model produces these raw scores (logits) for every token before the softmax function is applied.
Softmax Normalization: The logits are normalized into probabilities, which then determine the selected token.

Example

For instance, if the model is predicting the next word after "The sky is", it might output:

Token	Logit (Raw Score)	Probability (After Softmax)
blue	5.2	75%
clear	3.1	20%
rainy	1.5	5%

Here, “blue” is chosen because it has the highest probability.

How logit2prompt Differs from Other Methods

Unlike techniques that rely solely on the generated text, logit2prompt uses the full set of logits, the entire probability distribution. Because these logits carry detailed, hidden clues about the original prompt, they offer a richer “fingerprint” of the text used to condition the model. However, this approach requires direct access to the logits, which is not always possible with black-box APIs.

Other methods include:

output2prompt: Works using only the generated text.
Reverse Prompt Engineering (RPE): Uses a zero-shot approach without training on logits.

How logit2prompt Works

1. Generating Outputs and Collecting Logits

Feeding the prompt: A prompt $p$ is fed into a language model, for example, Llama-2 7B or Llama-2 Chat, which produces both the generated text and a set of logits for each token prediction.
Collecting the logits: Normally, these logits are computed during inference and then discarded after the next token is selected. In logit2prompt, the process is modified so that the full next-token probability distribution is recorded at every step. This is done by intercepting the output from the model's final layer (before softmax is applied) and saving the complete vector of raw scores for each token prediction.

2. Training the Inversion Model

Why train a model? The relationship between the logits (a high-dimensional set of numbers) and the original prompt is extremely complex. It isn't a simple one-to-one mapping that you can reverse manually. Therefore, researchers train a separate inversion model to learn this mapping.
Which model? The inversion model is typically built on a Transformer architecture, such as a T5-based encoder-decoder. This model is trained on many pairs of data:
- Input: The collected logits (often processed into a sequence of “pseudo-embeddings”).
- Output: The original prompt text.

Through training, the inversion model learns to “decode” the numerical fingerprint of the logits back into human-readable text.

3. Unrolling the Logits into Pseudo-Embeddings

Breaking down the vector: Because the full probability vector is very large (covering thousands of tokens), it is “unrolled” into smaller segments. Each segment is processed by a small neural network (an MLP) to produce a pseudo-embedding.
Forming a sequence: These pseudo-embeddings, which capture the detailed structure of the logits, are then fed into the encoder of the inversion model. The decoder uses this information to reconstruct the original prompt.

4. Prompt Reconstruction

Once the inversion model is trained, it can take only the processed logits as input and reconstruct an estimated prompt $p'$ that closely resembles the original $p$ .

Defending Against logit2prompt

For those concerned with prompt privacy, several defenses are suggested:

Increase sampling randomness: Techniques like temperature scaling or top-K sampling can make the inversion less accurate.
Limit logit exposure: Restricting access to the full probability distribution prevents inversion.
Add noise to probabilities: Perturbing the logits makes the reconstruction process less reliable.

Conclusion

logit2prompt is a groundbreaking method in language model inversion that demonstrates how next-token probability distributions (logits) can be used to nearly reconstruct hidden prompts. By modifying the inference process to collect full logits and training a T5-based inversion model to decode them, the technique shows that the internal “fingerprint” of a prompt is much richer than the final generated text. Although powerful, this approach is limited to situations where full access to the logits is available, which is not the case for many black-box models.

Footnotes

Morris, J. X., Zhao, W., Chiu, J. T., Shmatikov, V., & Rush, A. M. (2023). Language Model Inversion. https://arxiv.org/abs/2311.13647 ↩

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

On this page

What are Next-Token Probability Distributions (Logits)?
How logit2prompt Differs from Other Methods
How logit2prompt Works
Defending Against logit2prompt
Conclusion

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

Live Courses

logit2prompt