Some AI Detection Tools Work Well, Others Fail, Says New Research
A study conducted at the University of Chicago found AI detection tools varied widely in quality, and proposes a framework for use both inside and outside of education.
Researchers at the University of Chicago recently put AI detection tools to the test, comparing hundreds of human-generated pieces of writing to AI-generated content.
For the study, the researchers systematically analyzed how different AI detection tools performed in terms of false-negatives (incorrectly labeling AI-written text as human-written) and false-positives (incorrectly labeling human-generated text as AI-written).
Brian Jabarian, a co-author of this AI detection study, is an economist at the University of Chicago who studies AI. He discusses the study’s findings with me and shares how AI detection tools can be used responsibly to protect academic integrity and, more broadly, guard society against the spread of AI slop.
AI Detection Study Results
Jabarian and his colleague, Alex Imas, tested a mix of commercial and open-source AI tools, including Pangram, OriginalityAI, GPTZero, and RoBERTa, an open-source AI detection tool.
Pangram was the most successful with what the researchers describe as “essentially zero false-positive rates and false-negative rates on medium-length to long passages,” and remained successful even when AI humanizers were used. That was much better than the open-source detection fared.
“Unfortunately, in this situation, the open-source tools are not great at all,” Jabarian says. RoBERTa had a false-positive rate ranging from 30%-78%, which is higher than most educators would accept.
But even some potentially costly commercial models performed poorly in the situations Jabarian and Imas tested. GPTZero had a lower false-positive rate than OriginalityAI, but OriginalityAI was better able to distinguish AI from human text. GPTZero also struggled with humanizers, with a false-negative rate of around 50% when humanizers were used.
Tools and ideas to transform education. Sign up below.
Key Takeaways For Educators
In one way, the research is a warning sign for educators against relying on many AI detection tools when making decisions about student grades and disciplinary action. But the finding that at least one AI detection tool performs very well suggests there may be a better way to utilize these resources.
Jabarian believes more transparency around AI detection tools is needed so educators can see what caused a piece of writing to be flagged and, ideally, where and how AI may have been used.
As a native French speaker, Jabarian says he uses AI-powered tools to help avoid grammar and spelling mistakes. “Should I be punished for that? Absolutely not,” he says. Ideally, AI detection tools could be used to help teachers guide students to appropriate use of AI when it's enhancing their writing rather than completing it for them.
A Framework For AI Detection
The paper also provides a framework for using AI detection tools by educators and others that acknowledges different situations have different AI detection needs.
As part of this framework, Jabarian and Imas argue in favor of more user controls for how AI detection tools are set up. For instance, if they had the option to adjust the sensitivity of an AI detector, educators would want to set a conservative false-positive rate, “as this would facilitate LLM-assisted writing (e.g., Grammarly)—since it would unlikely be flagged,” the co-authors write in the paper. Additionally, in an education setting a false AI accusation can have significant consequences for students.
Educators tend to think about AI detection tools primarily in terms of work with students. But these have importance beyond that, Jabarian says. He notes that AI-generated slop is becoming a big issue on social media and for crowd-sourced reviews of restaurants and Amazon. Effective AI detection can help humans better weed this AI content out. And in these settings, Jabarian says, you’d ideally be able to set an AI detection tool with a larger margin of error when it comes to false-positives.
“I want to be sure you catch all the AI buzz,” he says. In this case, if the detection tool falsely flags some content as AI-generated, the stakes are lower. Ultimately, he believes if developed and deployed responsibly, AI detection tools can help society do a better job of weeding out inappapriate AI use both inside and outside of the classroom.
Enjoy our content? Make sure to add Tech & Learning as a preferred source on Google to keep up with our latest news, how-tos, profiles, events, and more.
Erik Ofgang is a Tech & Learning contributor. A journalist, author and educator, his work has appeared in The New York Times, the Washington Post, the Smithsonian, The Atlantic, and Associated Press. He currently teaches at Western Connecticut State University’s MFA program. While a staff writer at Connecticut Magazine he won a Society of Professional Journalism Award for his education reporting. He is interested in how humans learn and how technology can make that more effective.

