Obfuscation/Token Smuggling
- Obfuscation techniques evade filters by disguising sensitive words or phrases through encoding, typos, or translation.
- Token smuggling uses creative ways to bypass content filters while preserving the meaning of restricted content.
- Common methods include Base64 encoding, character substitution, multilingual translation, and partial word completion.
- Defense strategies require multi-layered approaches including fuzzy matching and cross-language analysis.
What is Obfuscation/Token Smuggling?
Obfuscation is a technique that attempts to evade content filters by modifying how restricted words or phrases are presented. This can be done through encoding, character substitution, or strategic text manipulation.
Token smuggling refers to techniques that bypass content filters while preserving the underlying meaning. While similar to obfuscation, it often focuses on exploiting the way language models process and understand text.
Types of Obfuscation Attacks
1. Syntactic Transformation
Syntactic transformation attacks modify text while maintaining its interpretability:
Encoding Methods
- Base64 encoding
- ROT13 cipher
- Leet speak (e.g., "h4ck3r" for "hacker")
- Pig Latin
- Custom ciphers
Example: Base64 Encoding
Below is a demonstration of Base64 encoding to bypass filters:
2. Typo-based Obfuscation
Typo-based attacks use intentional misspellings that remain human-readable:
Common Techniques
- Vowel removal (e.g., "psswrd" for "password")
- Character substitution (e.g., "pa$$w0rd")
- Phonetic preservation (e.g., "fone" for "phone")
- Strategic misspellings (e.g., "haccer" for "hacker")
3. Translation-based Obfuscation
Translation attacks leverage language translation to bypass filters:
Methods
- Multi-step translation chains
- Low-resource language exploitation
- Mixed-language prompts
- Back-translation techniques
Example
English β Rare Language β Another Language β English, with each step potentially bypassing different filters.
Conclusion
Obfuscation and token smuggling represent sophisticated challenges in AI safety. While these techniques can bypass traditional filtering mechanisms, understanding their methods helps in developing more robust defenses. As language models continue to evolve, both attack and defense strategies will need to adapt accordingly.
Footnotes
-
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., & Hashimoto, T. (2023). Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. β©
-
u/Nin_kat. (2023). New jailbreak based on virtual functions - smuggle illegal tokens to the backend. https://www.reddit.com/r/ChatGPT/comments/10urbdj/new_jailbreak_based_on_virtual_functions_smuggle β©
-
Rao, A., Vashistha, S., Naik, A., Aditya, S., & Choudhury, M. (2024). Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks. https://arxiv.org/abs/2305.14965 β©
-
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what youβve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. https://arxiv.org/abs/2302.12173 β©
Sander Schulhoff
Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.