๐๏ธ ๐ข Overview
Preventing prompt injection can be extremely difficult, and there exist few robust defenses against it(@crothers2022machine)(@goodside2021gpt). However, there are some commonsense
๐๏ธ ๐ข Filtering
Filtering is a common technique for preventing prompt hacking(@kang2023exploiting). There are a few types of filtering, but the basic idea is to check for words and phrase in the initial prompt or the output that should be blocked. You can use a blocklist or an allowlist for this purpose(@selvi2022exploring). A blocklist is a list of words and phrases that should be blocked, and an allowlist is a list of words and phrases that should be allowed.
๐๏ธ ๐ข Instruction Defense
You can add instructions to a prompt, which encourage the model to be careful about
๐๏ธ ๐ข Post-Prompting
The post-prompting defense(@christoph2022talking) simply puts
๐๏ธ ๐ข Random Sequence Enclosure
Yet another defense is enclosing the user input between two random sequences of characters(@armstrong2022using). Take this prompt as an example:
๐๏ธ ๐ข Sandwich Defense
The sandwich defense involves sandwiching user input between
๐๏ธ ๐ข XML Tagging
XML tagging can be a very robust defense when executed properly (in particular with the XML+escape). It involves surrounding user input by XML tags (e.g. ``). Take this prompt as an example:
๐๏ธ ๐ข Separate LLM Evaluation
A separate prompted LLM can be used to judge whether a prompt is adversarial.
๐๏ธ ๐ข Other Approaches
Although the previous approaches can be very robust, a few other approaches, such as using a different model, including fine tuning, soft prompting, and length restrictions, can also be effective.