Detecting AI generated text is a big problem for safety researchers and educators, among others. Tools like GPTZero, GPT2 detector, and bilingual detectors have seen significant success, However, they can be tricked.
OpenAI and other researchers are working to introduce statistical watermarking into their generated text, but this too may be fooled by modifying large portions of the text.
The problem of AI text detection will likely be an arms race as new models and new detection methods are introduced. Many companies have already started to build solutions which they claim are very effective, but it is difficult to prove this, especially as models change over time.
This article will cover some of the current methods for detecting AI-generated text, and the next will discuss a few ways people have found to fool them.
The OpenAI Text Classifier is a fairly good attempt at a general-purpose AI text detector. By training the model on a large quantity of AI-generated data and human-written text of a similar quality, the detector is able to compute the likelihood that any given text was created by an LLM.
It has a number of limitations—it doesn’t accept any submission of under 1000 words, text can easily be edited to mess with the probability calculations, and because of its professionally-focused training set, it has more trouble with text created by children or non-english speakers.
It currently flags human text as AI-generated only about 9% of the time, and correctly identifies AI-generated text ~26% of the time. As the model increases in power and scope, those numbers will improve, but it may be the case that more specific detectors are required to adequately assess whether text is generated or not.
One method to detect AI generated text requires introducing a statistical watermark when generating the text. These techniques may use a LLM “whitelist”, which is a method of determining if text was generated by a specific AI model. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting use of the selected tokens during sampling. These weighted values have a minimal effect on the quality of generations, but can be algorithmically detected by another LLM.
This is an intriguing idea, but it requires a model’s creators to implement this framework into their LLM. If a model doesn’t have the watermark built in, this method will not work.
The DetectGPT method is able to detect AI-generated text with less setup than the previous concepts. Researchers have found that LLM text generations tend to "occupy negative curvature regions of the model’s log probability function". Because of this, it is possible to create a curvature-based system for determining if a block of text was procedurally generated.
It works by computing log probabilities from the model that was thought to have generated the text and comparing them to random alterations of the text from another, pre-trained generic language model. In this way, DetectGPT is able to identify the likelihood of the passage being generated using probability curves alone!
For an additional discussion on the topic of detectors and how people are tricking them, see this article.
Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.
Bansal, A., yeh Ping-Chiang, Curry, M., Jain, R., Wigington, C., Manjunatha, V., Dickerson, J. P., & Goldstein, T. (2022). Certified Neural Network Watermarks with Randomized Smoothing. ↩
Gu, C., Huang, C., Zheng, X., Chang, K.-W., & Hsieh, C.-J. (2022). Watermarking Pre-trained Language Models with Backdooring. ↩
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. https://arxiv.org/abs/2301.10226 ↩
Mitchell, E., Lee, Y., Khazatsky, A., Manning, C., & Finn, C. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. https://doi.org/10.48550/arXiv.2301.11305 ↩