Announcing our new Course: AI Red-Teaming and AI Safety Masterclass
Check it out →Filtering is a common technique for preventing prompt hacking1. There are a few types of filtering, but the basic idea is to check for words and phrase in the initial prompt or the output that should be blocked. You can use a blocklist or an allowlist for this purpose2. A blocklist is a list of words and phrases that should be blocked, and an allowlist is a list of words and phrases that should be allowed.
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., & Hashimoto, T. (2023). Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. ↩
Selvi, J. (2022). Exploring Prompt Injection Attacks. https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/ ↩