Scientists Compared ChatGPT Writing Assessments to Human Assessments. Here’s What They Found

A phone screen with the words "AI" printed above OpenAI and Gemini apps.
(Image credit: Photo by Solen Feyissa on Unsplash)

When it comes to assessing student writing, AI is getting a passing grade. 

At least that’s what two recent studies comparing robot (AI)-generated assessments to human ones suggested. In one study, researchers compared written feedback ChatGPT gave to feedback human teachers gave based on 200 source-based argument essays in history written by students grades 6-12 from 26 classrooms in two school districts in Southern California. In another study, student papers were given a number score by both teachers and various versions of ChatGPT, and researchers looked at how consistent different humans and versions of ChatGPT were when it came to grading. 

In the written feedback study, the cohort of skilled human teachers performed marginally better than their AI counterparts, while in the other study, the AI slightly outperformed humans.

Steve Graham, a co-author on both studies and professor at Arizona State University, says that while humans did better in the written assessment study and AI arguably did a little better in the study in which only a score was given, overall AI and humans were pretty close.

He discusses the findings, limitations, and implications of both studies in more detail. 

Human Vs. AI Writing Assessment: What The Research Found  

For the written assessment study, Graham and his co-authors measured how humans performed vs. ChatGPT across five components of feedback:

  • Whether the feedback was criteria-based.
  • Whether the feedback offered clear directions.
  • Whether the feedback was accurate.
  • Whether the feedback prioritized essential features (to avoid overwhelming a student with too much feedback).
  • Whether the feedback was given with a supportive tone.

The humans outperformed the AI on all these criteria but one. “AI actually outperformed the humans on [assessing] how well students took information from source materials,” Graham says. 

And even though human teachers performed better in the study overall, Graham stresses that ChatGPT might do better relative to humans when compared to a random group of teachers. “In this study, we had what you could think of as expert feedback givers,” he says, adding the teachers in this study represented, “if not the best-case scenario, a very good-case scenario.”

In the other study looking at how consistent ChatGPT paper scores were vs. humans, the chatbot actually performed better than humans, but there was still inconsistency in the grading between various generations of ChatGPT technology.

“When you compared a couple of different AI systems, AI was a bit more reliable than humans were,” Graham says. “Humans, which we often consider this gold standard, had about a 43% match, and AI’s were somewhere between 50 and 82%.” 

However, this research highlighted the problems with inconsistency in the grading of written work overall. Neither humans nor machines, Graham says, were as consistent with grading “as we might have liked.”

Takeaways For Teachers From The Research  

Overall, these papers suggest that there may be a role for AI in writing assessments going forward, perhaps as a tool that students can use before submitting papers to improve the work, and also possibly as a time-saving tool for teachers. Providing more feedback is one the best ways to improve student writing, Graham says, but often busy teachers with large class sizes don’t have the time to up the number of assessments they can provide. 

However, there are some limitations to the current research. For the study on written assessment, Graham and his colleagues used specifically designed prompts that were tested and vetted by experts in using technology for writing instruction. Therefore, these were stronger prompts than classroom teachers might be expected to write for AI.

On the other hand, at the pace at which AI technology is improving, the conversation around AI grading may be different in the not-too-distant future. “My guess will be that AI will get better at giving feedback over time,” Graham says. 

But implementing the technology in schools both in its current form and when it presumably improves in the future, requires more training and emphasis on digital literacy for both students and teachers, Graham says.

“If we're going to see it used in the classroom a lot, we still got a long way to go. Teachers need to become more confident and familiar with AI,” he says. “We need to resolve issues about unsanctioned use and ethical concerns, and we have to think really carefully about how we're going to be putting AI into play.”  

Erik Ofgang

Erik Ofgang is a Tech & Learning contributor. A journalist, author and educator, his work has appeared in The New York Times, the Washington Post, the Smithsonian, The Atlantic, and Associated Press. He currently teaches at Western Connecticut State University’s MFA program. While a staff writer at Connecticut Magazine he won a Society of Professional Journalism Award for his education reporting. He is interested in how humans learn and how technology can make that more effective.