Last updated on August 7, 2024
Preventing prompt injection can be extremely difficult, and there exist few robust defenses against it. However, there are some common-sense solutions. For example, if your application does not need to output free-form text, do not allow such outputs. There are many different ways to defend a prompt. We will discuss some of the most common ones here.
This chapter covers additional commonsense strategies like filtering out words. It also covers prompt improvement strategies (instruction defense, post-prompting, different ways to enclose user input and XML tagging). Finally, we discuss using an LLM to evaluate output and some more model-specific approaches.
Crothers, E., Japkowicz, N., & Viktor, H. (2022). Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods. β©
Goodside, R. (2022). GPT-3 Prompt Injection Defenses. https://twitter.com/goodside/status/1578278974526222336?s=20&t=3UMZB7ntYhwAk3QLpKMAbw β©