Prompt Injection VS Jailbreaking: What is the difference?
February 4th, 2024 by Sander Schulhoff
I had always thought of Prompt Injection as tricking a model into doing something bad and Jailbreaking as specifically tricking it into doing things against a company's TOS, and usually with a chatbot (Thus, Jailbreaking would be a subset of Prompt Injection). A recent conversation on X (formerly Twitter) completely changed my mind.
I will start with my new understandings of these topics, since that is likely what you are here for. Then I will go into the conversation that changed my mind, and finally how I arrived at my mistaken beliefs.
The Definitions
Prompt Injection and Jailbreaking do not mean the same thing. They relate to distinct issues with LLMs.
Prompt Injection
For Prompt Injection, the
For
Here is my definition of Prompt Injection:
Prompt Injection is the process of overriding original instructions in the prompt with special untrusted input. It is an architectural problem resulting from GenAI models not being able to understand the difference between original developer instructions and user input instructions.
In direct PI, the untrusted input is user input, while in indirect PI, this might come from website content or another "external source".
Jailbreaking
For Jailbreaking, the
An analogous
Here is my definition of Jailbreaking:
Jailbreaking is the process of getting a GenAI model to do or say unintended things through prompting. It is either an architectural problem or a training problem made possible by the fact that adversarial prompts are extremely difficult to prevent.
Althought both Prompt Injection and Jailbreaking have negative connotations, I leave out malicious intent from my definitions. In many cases, Prompt Injection and Jailbreaking can be used for good. For example, many enterprise-grade LLMs are naturally resistant to talking about sexuality. However, the MMLU benchmark1 contains a large amount of these questions. As such, it may be necessary for researchers to Jailbreak models in order to properly evaluate them on this benchmark.
The Conversation
Now onto how I came to change my understanding of these terms. This all started when Riley Goodside posted about a new invisible text-based Prompt Injection attack.
Pliny the Prompter questioned whether this was truly considered Prompt Injection.
Riley responded with a great explanation of Jailbreaking vs Prompt Injection.
Pliny responded, noting that they had been using a definition from Lakera.
Simon responded that he disagrees with the current Lakera definition. His comment is what really helped me understand the difference.
In a follow up conversation, Simon (correctly) notes that the HackAPrompt paper2 also gets it wrong.
HackAPrompt is a global AI security competition we ran over the last year (shameless academic plug, we won Best Theme Paper at EMNLP2023). It was sponsored by OpenAI, Preamble, HuggingFace, and 10 other AI companies, and reached thousands of people around the world.
Shoutout to Kai Greshake for this great graphic, which helped me visualize the difference.
How I Made This Mistake
When I was writing the HackAPrompt paper I couldn't find a good definition of these terms, so I made one up!
Ok, its a bit more nuanced than this. I read Simon's Blog3 and spoke with Riley Goodside and Preamble. I also read hundreds (truly hundreds) of papers and blog posts about Prompt Injection and Jailbreaking. I found academic papers that conflated them or defined them differently from other papers. I just couldn't find something clear, nor could I reach relevant parties who could explain them to me, so I used what seemed like community consensus. Anyways, I was wrong! This has been cool to learn about, and I am glad I was able to share it with you :)
Quick History of Prompt Injection
- Riley Goodside Discovered it4 and publicized it.
- Simon Willison coined the term3.
- Preamble also discovered it5. They were likely the first to discover it, but didn't publicize it at first.
- Kai Greshake discovered Indirection Prompt Injection6.
More Information
If you enjoyed this article and want to learn more about prompt hacking (both prompt injection and jailbreaking), checkout our intro to prompt hacking course and our advanced prompt hacking course.
Citation
Please cite this post as:
@article{Ignore2024Schulhoff,
Title = {Prompt Injection VS Jailbreaking: What is the difference?},
Author = {Sander V Schulhoff},
Year = {2024},
url={https://learnprompting.org/blog/2024/2/4/injection_jailbreaking}
}
Thank you to Kai Greshake and rez0 for feedback on this, and again to rez0 for finding a mistake in my description of the invisible text-based attack.
Footnotes
-
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring Massive Multitask Language Understanding. ↩
-
Schulhoff, S. V., Pinto, J., Khan, A., Bouchard, L.-F., Si, C., Boyd-Graber, J. L., Anati, S., Tagliabue, V., Kost, A. L., & Carnahan, C. R. (2023). Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition. Empirical Methods in Natural Language Processing. ↩
-
Willison, S. (2022). Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injection/ ↩ ↩2
-
Goodside, R. (2022). Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. https://twitter.com/goodside/status/1569128808308957185 ↩
-
Branch, H. J., Cefalu, J. R., McHugh, J., Hujer, L., Bahl, A., del Castillo Iglesias, D., Heichman, R., & Darwishi, R. (2022). Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples. ↩
-
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models. ↩