🔓 Prompt Hacking🟢 Defensive Measures🟢 Filtering

Filtering

🟢 This article is rated easy

Reading Time: 1 minute

Last updated on August 7, 2024

Filtering is a common technique for preventing prompt hacking. There are a few types of filtering, but the basic idea is to check for words and phrases in the initial prompt or the output that should be blocked. You can use a blocklist or an allowlist for this purpose.

Tip

Interested in prompt hacking and AI safety? Test your skills on HackAPrompt, the largest AI safety hackathon. You can register here.

Note

Want to pursue a career in AI Red Teaming? Check out our AI Red Teaming Masterclass, available On-Demand or as a Live cohort. Earn your AIRTP+ certification.

Blocklist Filtering

A blocklist is a list of words and phrases that should be blocked from user prompts. For example, you can write some simple code to check for text in user input strings to prevent the input from including certain words or phrases related to sensitive topics such as race, gender discrimination, or self-harm.

Allowlist Filtering

An allowlist is a list of words and phrases that should be allowed in the user input. Similarly to blocklisting, you can write similar string-checking functions to only accept the words and phrases in the allowlist and block everything else.

Conclusion

Filtering through blocklists and allowlists is an effective method of moderation to ensure that your AI systems are not susceptible to hacks that expose biased or harmful content in the model's responses.

Sander Schulhoff

Sander Schulhoff is the CEO of HackAPrompt and Learn Prompting. He created the first Prompt Engineering guide on the internet, two months before ChatGPT was released, which has taught 3 million people how to prompt ChatGPT. He also partnered with OpenAI to run the first AI Red Teaming competition, HackAPrompt, which was 2x larger than the White House's subsequent AI Red Teaming competition. Today, HackAPrompt partners with the Frontier AI labs to produce research that makes their models more secure. Sander's background is in Natural Language Processing and deep reinforcement learning. He recently led the team behind The Prompt Report, the most comprehensive study of prompt engineering ever done. This 76-page survey, co-authored with OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions, analyzed 1,500+ academic papers and covered 200+ prompting techniques.

Footnotes

Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., & Hashimoto, T. (2023). Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. ↩
Selvi, J. (2022). Exploring Prompt Injection Attacks. https://research.nccgroup.com/2022/12/05/exploring-prompt-injection-attacks/ ↩

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

AI Red-Teaming and AI Security Masterclass