Prompt Engineering Guide
😃 Basics
💼 Applications
🧙‍♂️ Intermediate
🧠 Advanced
Special Topics
🌱 New Techniques
🤖 Agents
⚖️ Reliability
🖼️ Image Prompting
🔓 Prompt Hacking
🔨 Tooling
💪 Prompt Tuning
🗂️ RAG
🎲 Miscellaneous
Models
📝 Language Models
Resources
📙 Vocabulary Resource
📚 Bibliography
📦 Prompted Products
🛸 Additional Resources
🔥 Hot Topics
✨ Credits

Detecting AI Generated Text

🟢 This article is rated easy
Reading Time: 3 minutes
Last updated on August 7, 2024

Sander Schulhoff

Takeaways
  • AI text detection is crucial for safety and education, with tools like OpenAI Text Classifier, watermarking, and DetectGPT being developed to address this challenge.
    • OpenAI Text Classifier were simply trained on AI-generated texts alongside similar human-created texts, but can be easily deceived.
    • Watermarking introduces statistical markers during text generation, allowing for detection by specific models, but requires implementation by the creators of the AI.
    • DetectGPT utilizes curvature in log probability functions to assess whether text was generated by comparing it with random alterations from another model.

Detecting AI-generated text is a big problem for safety researchers and educators, among others. Tools like GPTZero, GPT2 detector, and bilingual detectors have seen significant success, However, they can be tricked.

OpenAI and other researchers are working to introduce statistical watermarking into their generated text, but this too may be fooled by modifying large portions of the text.

The problem of AI text detection will likely be an arms race as new models and new detection methods are introduced. Many companies have already started to build solutions that they claim are very effective, but it is difficult to prove this, especially as models change over time.

This article will cover some of the current methods for detecting AI-generated text, and the next will discuss a few ways people have found to fool them.

OpenAI Text Classifier

The OpenAI Text Classifier is a fairly good attempt at a general-purpose AI text detector. By training the model on a large quantity of AI-generated data and human-written text of similar quality, the detector can compute the likelihood that any given text was created by an LLM.

It has several limitations—it doesn’t accept any submission of under 1000 words, text can easily be edited to mess with the probability calculations, and because of its professionally-focused training set, it has more trouble with text created by children or non-English speakers.

It currently flags human text as AI-generated only about 9% of the time, and correctly identifies AI-generated text ~26% of the time. As the model increases in power and scope, those numbers will improve, but it may be the case that more specific detectors are required to adequately assess whether text is generated or not.

The Watermark Method

One method to detect AI-generated text requires introducing a statistical watermark when generating the text. These techniques may use a LLM “whitelist”, which is a method of determining if the text was generated by a specific AI model. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting the use of the selected tokens during sampling. These weighted values have a minimal effect on the quality of generations but can be algorithmically detected by another LLM.

This is an intriguing idea, but it requires a model’s creators to implement this framework into their LLM. If a model doesn’t have the watermark built in, this method will not work.

DetectGPT

The DetectGPT method can detect AI-generated text with less setup than the previous concepts. Researchers have found that LLM text generations tend to "occupy negative curvature regions of the model’s log probability function". Because of this, it is possible to create a curvature-based system for determining if a block of text was procedurally generated.

It works by computing log probabilities from the model that was thought to have generated the text and comparing them to random alterations of the text from another, pre-trained generic language model. In this way, DetectGPT can identify the likelihood of the passage being generated using probability curves alone!

Note

Find additional discussion on the topic of detectors and how people are tricking them.

Sander Schulhoff

Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.

Footnotes

  1. Bansal, A., yeh Ping-Chiang, Curry, M., Jain, R., Wigington, C., Manjunatha, V., Dickerson, J. P., & Goldstein, T. (2022). Certified Neural Network Watermarks with Randomized Smoothing.

  2. Gu, C., Huang, C., Zheng, X., Chang, K.-W., & Hsieh, C.-J. (2022). Watermarking Pre-trained Language Models with Backdooring.

  3. Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. https://arxiv.org/abs/2301.10226

  4. Mitchell, E., Lee, Y., Khazatsky, A., Manning, C., & Finn, C. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. https://doi.org/10.48550/arXiv.2301.11305