News — Grading can be a time-consuming task for many teachers. Artificial intelligence tools may help ease the strain, according to from the University of Georgia.
Many states have adopted the , which emphasize the importance of argumentation, investigation and data analysis. But teachers following the curriculum face challenges when it’s time to grade students’ work.
“Asking kids to draw a model, to write an explanation, to argue with each other are very complex tasks,” said , corresponding author of the study and an associate professor and director of AI4STEM Education Center in UGA’s . “Teachers often don’t have enough time to score all the students’ responses, which means students will not be able to receive timely feedback.”
AI is fast but bases grading off shortcuts
The study explored how Large Language Models grade students’ work compared to humans. LLMs are a type of AI that are trained using a large amount of information, usually from the internet. They use that data to “understand” and generate human language.
For the study, the LLM Mixtral was presented with written responses from middle school students. One question asked students to create a model showing what happens to particles when heat energy is transferred to them. A correct answer would indicate that molecules move slower when cold and faster when hot.
Mixtral then constructed rubrics to assess student performance and assign final scores.
"We still have a long way to go when it comes to using AI, and we still need to figure out which direction to go in.” —Xiaoming Zhai, College of Education
The researchers found that LLMs could grade responses quickly, but they often used shortcuts like spotting certain keywords and assuming that a student understands a topic. This, in turn, lowered its accuracy when assessing students’ grasp of the material.
The study suggests that LLMs could be improved by providing them with rubrics that show the deep, analytical thought humans use when grading. These rubrics should include specific rules on what the grader is looking for in a student’s response. The LLM could then evaluate the answer based on the rules the human set.
“The train has left the station, but it has just left the station,” said Zhai. “It means we still have a long way to go when it comes to using AI, and we still need to figure out which direction to go in.”
LLMs and human graders differ in their scoring process
Traditionally, LLMs are given both the students’ answers and the human grader’s scores to train them. In this study, however, LLMs were instructed to generate their own rubric to evaluate student responses.
The researchers found that the rubrics generated by LLMs had some similarities with those made by humans. LLMs generally understand what the question is asking of students, but they don’t have the ability to reason like humans do.
Instead, LLMs rely mostly on shortcuts, such as what Zhai referred to as “over-inferring.” This is when an LLM assumes a student understands something when a human teacher wouldn’t.
For example, LLMs will mark a student’s response as correct if it includes certain keywords but can’t evaluate the logic the student is using.
“Students could mention a temperature increase, and the large language model interprets that all students understand the particles are moving faster when temperatures rise,” said Zhai. “But based upon the student writing, as a human, we’re not able to infer whether the students know whether the particles will move faster or not.”
LLMs are especially reliant on shortcuts when presented with examples of graded responses without explanations of why certain papers are assigned the grades they were given.
Humans still have a role in automated scoring
Despite the speed of LLMs, the researchers warn against replacing human graders completely.
Human-made rubrics often have a set of rules that reflect what the instructor expects of student responses. Without such rubrics, LLMs only have a 33.5% accuracy rate. When the AI has access to human-made rubrics, that accuracy rate jumps to just over 50%.
If the accuracy of LLMs can be improved further, though, educators may be open to using the technology to streamline their grading processes.
“Many teachers told me, ‘I had to spend my weekend giving feedback, but by using automatic scoring, I do not have to do that. Now, I have more time to focus on more meaningful work instead of some labor-intensive work,’” said Zhai. “That’s very encouraging for me.”
The was published in Technology, Knowledge and Learning and was co-authored by Xuansheng Wu, Padmaja Pravin Saraf, Gyeonggeon Lee, Eshan Latif and Ninghao Liu.