High School Math Students Used A GPT-4 AI Tutor. They did Worse.

An AI math tutor
(Image credit: Image by Vicki Hamilton from Pixabay)

Researchers at the University of Pennsylvania recently studied the impact of GPT-4-powered tutors on nearly a thousand high school students. The results suggest the GPT-4 tutors might themselves need additional assistance when it comes to helping students.

Students who had access to an AI tutor for practice exams did better than students without access in these practice exams. However, on a subsequent exam, when none of the students had access to an AI tutor, the students who worked with an AI tutor did worse than other students.

“Generative AI Can Harm Learning”, the paper summarizing these findings, was recently published by the Wharton School of The University of Pennsylvania.

“The one-sentence punchline summary of the paper is, 'We find that generative AI could hurt learning because students potentially use it as an answer machine, as opposed to a tool that is conducive for learning,'” says Alp Sungu, one of the paper’s co-authors and a professor at the Wharton School.

However, Sungu and his co-authors stress that they are not anti-AI tutors, and believe AI tutors ultimately can be helpful in certain contexts if designed correctly.

AI GPT Tutors: How The Study Was Designed

To study the impact of AI tutors on math students, Sungu and his colleagues used separate prompts to create two different GPT-4-powered tutors. One they called “GPT Base,” which worked similarly to standard versions of ChatGPT in that if given a math problem, it will give away the answers to the question while helping students. A second GPT-4 tutor was created using more advanced prompts that told the AI not to give away the answer while working with students, and instead helped them find the answer on their own. The researchers called this “GPT Tutor.”

The researchers then worked with nearly a thousand students in Turkey who were in grades 9, 10, and 11 in the 2023-24 school year. The pre-registered randomized control trial placed students into three groups: a group with no AI tutor, a group that used the GPT Base tutor, and a group that used the more advanced GPT Tutor. After a lesson from a teacher all students in the study had a practice exam. Those with access to GPT Base did 48% better than the students without access to an AI tutor, while those with GPT Tutor performed 127% better on the practice exam.

However, on the actual test itself, the GPT-based students did 17% worse and the GPT Tutor group performed on average the same as the control group. The bright side was GPT Tutor seemed to mitigate the negative impact of an AI tutor on students, though it didn’t help them either.

“My thinking was GPT Tutor would be better than the control group,” Sungu says. “That was not the case.” However, with better prompting and improvements to the tutor in the future, he believes it could ultimately help student learning.

What Are The Takeaways For Teachers?

The study highlights some of the differences between AI use in educational settings versus professional ones.

Coders and others who use AI professionally tend to accomplish more with the help of AI. “If you give them a task, they will deliver the task more efficiently, more effectively, they become more productive,” Sungu says. “There's already a lot of literature that shows that.”

However, in education, teachers are interested not just in the immediate output but what the student has actually learned. For instance, students might be able to write more quality papers by better prompting GPT — but that doesn’t mean they’ll learn more, or anything for that matter, about writing.

This is why Sungu believes that for an AI tutor to become more effective, it will have to focus on learning rather than productivity. That requires more research specifically around AI in education. Otherwise, Sungu believes even though AI might get more advanced, it won’t change the inherent challenges surrounding AI and learning. ​​”The technology is already good at providing answers,” Sungu says. “We need to think about the design of assessment and educational delivery.”

Some might argue that since AI is available to everyone now, testing student achievement without the use of AI might no longer matter. Sungu understands that line of reasoning, and notes that there are certain technologies that make some skills obsolete, such as the calculator. “I really don't care if you cannot multiply 7 and 8 off the top of your head, because we do trust calculators,” he says.

On the other hand, there are other skills, such as critical thinking and problem-solving, that are vital for students to develop without needing to rely on machines. “The example that we give in the paper is the Federal Aviation Agency banning junior pilots from completely depending on autopilot,” Sungu says. "When autopilot is inactive, we still need people to think for themselves.”

Erik Ofgang

Erik Ofgang is a Tech & Learning contributor. A journalist, author and educator, his work has appeared in The New York Times, the Washington Post, the Smithsonian, The Atlantic, and Associated Press. He currently teaches at Western Connecticut State University’s MFA program. While a staff writer at Connecticut Magazine he won a Society of Professional Journalism Award for his education reporting. He is interested in how humans learn and how technology can make that more effective.