💪 Prompt Tuning🟦 Interpretable Soft Prompts

Interpretable Soft Prompts

🟦 This article is rated medium

Reading Time: 2 minutes

Last updated on March 3, 2025

Takeaways

Interpretable soft prompts do exist, but they can have semantic meanings that are irrelevant to the target hard prompt.
The Wayward Hypothesis* says that for a given task and unrelated discrete prompt, a soft prompt exists that is performant on the task and maps to the discrete prompt.

Soft prompts are a sequence of vectors that don't correspond to any actual tokens in the vocabulary. This makes it difficult to interpret the prompt. However, we can still attempt to do so by mapping the vectors to the closest tokens in the vocabulary. However, projected soft prompts are often wayward; they can solve tasks well, but get projected to arbitrary tokens in the vocabulary.

For example, if we are training on math questions like GSM8K, we might start with the prompt You are a mathematician. Solve this question:. If we perform prompt tuning on it and then project that back into token space, we might be left with something nonsensical like A bus is a bus. Do thing here:. It is often the case that the soft prompt that maps to this nonsensical prompt can provide better performance on the task!

The Waywardness Hypothesis

Khashabi et al. propose this incredible hypothesis. It says that given a task, for any discrete target prompt, there exists a continuous prompt that projects to it, while performing well on the task.

This means that given 1000 different tasks, there exist 1000 different performant soft prompts (one for each task) which map to the same discrete prompt.

Interpretability Risks

They use the Waywardness Hypothesis to highlight a number of risks that arise when interpreting soft prompts. In particular, a soft prompt can be projected to a discrete prompt which gives a misleading intent.

Consider a soft prompt for ranking resumes. When projected into token space, it might be You hiring manager. Rank good resumes:. This seems decent, perhaps a bit lacking in grammaticality. However, the token good might have a similar projection as the token for white, and there could exist implicit bias in the prompt. Using a slightly different projection method, we could end up with You hiring manager. Rank white resumes:. This is obviously quite different and could have significant implications.

Similarly to interpreting a regular discrete prompt, we should be extremely conscious of the biases that might be present in the prompt. We must be especially careful with soft prompts, as they are more difficult to interpret.

Sander Schulhoff

Sander Schulhoff is the CEO of HackAPrompt and Learn Prompting. He created the first Prompt Engineering guide on the internet, two months before ChatGPT was released, which has taught 3 million people how to prompt ChatGPT. He also partnered with OpenAI to run the first AI Red Teaming competition, HackAPrompt, which was 2x larger than the White House's subsequent AI Red Teaming competition. Today, HackAPrompt partners with the Frontier AI labs to produce research that makes their models more secure. Sander's background is in Natural Language Processing and deep reinforcement learning. He recently led the team behind The Prompt Report, the most comprehensive study of prompt engineering ever done. This 76-page survey, co-authored with OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions, analyzed 1,500+ academic papers and covered 200+ prompting techniques.

Footnotes

Khashabi, D., Lyu, S., Min, S., Qin, L., Richardson, K., Welleck, S., Hajishirzi, H., Khot, T., Sabharwal, A., Singh, S., & Choi, Y. (2021). Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts. ↩ ↩²
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems. ↩

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

AI Red-Teaming and AI Security Masterclass

Live AI Security Courses

Interpretable Soft Prompts

The Waywardness Hypothesis

Interpretability Risks

Sander Schulhoff

Footnotes