Announcing our new Paper: The Prompt Report, with Co-authors from OpenAI & Microsoft!
Check it out →Filtering is a common technique for preventing prompt hacking1. There are a few types of filtering, but the basic idea is to check for words and phrase in the initial prompt or the output that should be blocked. You can use a blocklist or an allowlist for this purpose2.
A blocklist is a list of words and phrases that should be blocked from user prompts. For example, you can write some simple code to check for text in user input strings in order to prevent the input from including certain words or phrases related to sensitive topics such as race, gender discrimination, or self-harm.
An allowlist is a list of words and phrases that should be allowed in the user input. Similarly to blocklisting, you can write similar string-checking functions to only accept the words and phrases in the allowlist and block everything else.
Filtering through blocklists and allowlists is an effective method of moderation to ensure that your AI systems are not susceptible to hacks that expose biased or harmful content in the model's responses.
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., & Hashimoto, T. (2023). Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. ↩
Selvi, J. (2022). Exploring Prompt Injection Attacks. https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/ ↩
Sign up and get the latest AI news, prompts, and tools.
Join 30,000+ readers from companies like OpenAI, Microsoft, Google, Meta and more!