Announcing our new Course: AI Red-Teaming and AI Safety Masterclass
Check it out →Detecting AI-generated text is a big problem for safety researchers and educators, among others. Tools like GPTZero, GPT2 detector, and bilingual detectors have seen significant success, However, they can be tricked.
OpenAI and other researchers12 are working to introduce statistical watermarking into their generated text, but this too may be fooled by modifying large portions of the text.
The problem of AI text detection will likely be an arms race as new models and new detection methods are introduced. Many companies have already started to build solutions that they claim are very effective, but it is difficult to prove this, especially as models change over time.
This article will cover some of the current methods for detecting AI-generated text, and the next will discuss a few ways people have found to fool them.
The OpenAI Text Classifier is a fairly good attempt at a general-purpose AI text detector. By training the model on a large quantity of AI-generated data and human-written text of similar quality, the detector can compute the likelihood that any given text was created by an LLM.
It has several limitations—it doesn’t accept any submission of under 1000 words, text can easily be edited to mess with the probability calculations, and because of its professionally-focused training set, it has more trouble with text created by children or non-English speakers.
It currently flags human text as AI-generated only about 9% of the time, and correctly identifies AI-generated text ~26% of the time. As the model increases in power and scope, those numbers will improve, but it may be the case that more specific detectors are required to adequately assess whether text is generated or not.
One method to detect AI-generated text requires introducing a statistical watermark when generating the text. These techniques may use a LLM “whitelist”, which is a method of determining if the text was generated by a specific AI model. The watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting the use of the selected tokens during sampling. These weighted values have a minimal effect on the quality of generations but can be algorithmically detected by another LLM3.
This is an intriguing idea, but it requires a model’s creators to implement this framework into their LLM. If a model doesn’t have the watermark built in, this method will not work.
The DetectGPT4 method can detect AI-generated text with less setup than the previous concepts. Researchers have found that LLM text generations tend to "occupy negative curvature regions of the model’s log probability function". Because of this, it is possible to create a curvature-based system for determining if a block of text was procedurally generated.
It works by computing log probabilities from the model that was thought to have generated the text and comparing them to random alterations of the text from another, pre-trained generic language model. In this way, DetectGPT can identify the likelihood of the passage being generated using probability curves alone!
Find additional discussion on the topic of detectors and how people are tricking them.
Bansal, A., yeh Ping-Chiang, Curry, M., Jain, R., Wigington, C., Manjunatha, V., Dickerson, J. P., & Goldstein, T. (2022). Certified Neural Network Watermarks with Randomized Smoothing. ↩
Gu, C., Huang, C., Zheng, X., Chang, K.-W., & Hsieh, C.-J. (2022). Watermarking Pre-trained Language Models with Backdooring. ↩
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. https://arxiv.org/abs/2301.10226 ↩
Mitchell, E., Lee, Y., Khazatsky, A., Manning, C., & Finn, C. (2023). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. https://doi.org/10.48550/arXiv.2301.11305 ↩