Marking with AI: a quick and dirty experiment

Martin King

2 years ago

Enthusiastic thought leaders and numerous tech bros on Twitter/X are keen to encourage and normalise the offloading of work to Generative Artificial Intelligence (Gen AI). But is there a risk of automating the wrong activities and thereby failing to realise the potential that such technology promises? Author Joanna Maciejewska summarises her concerns thus:

“I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes.”
https://x.com/AuthorJMac/status/1773679197631701238

As marking is both a time-consuming and time-sensitive activity it might be a tempting place to start. But is automating the marking and feedback process through the use of GAI really a good idea? The value of quality feedback to students is clear to many, including Advance HE:

“Empowering and engaging learners through assessment design and providing opportunities for dialogic feedback is central to learning and the student experience.“
https://www.advance-he.ac.uk/teaching-and-learning/assessment-and-feedback-higher-education

As an insititution with ambitions to deliver a high-quality learning experience and reduce the disparities in student outcomes, marking and feedback are clearly our ‘art and writing’. So, one argument is that GAI could take care of the heavy lifting element of marking papers and thereby free up time for teachers to offer a more personal and conversational approach to feedback. A counter argument is that feedback, something so human and important, should not be entrusted to technology, especially one that presents so many known challenges.

Limitations of Gen AI

The known limitations of Gen AI are well documented and discussed. Using marking and feedback as a lens with which to examine these highlights a number of intersecting issues:

Lack of nuanced understanding: Gen AI can struggle to understand the nuances, context, and deeper meanings in student essays, leading to potentially superficial evaluations
Bias in Gen AI models: Gen AI models can perpetuate existing biases present in the training data, leading to unfair assessments of students’ work based on gender, ethnicity, use of language or other factors
Inability to assess creativity: Gen AI has limitations in recognising and evaluating creative and original thought, arguably a critical component of university-level work
Contextual misinterpretation: Gen AI may misinterpret context-specific content, such as culturally specific references or idiomatic expressions, leading to inaccurate grading
Ethical concerns: There are ethical concerns about transparency and fairness in using Gen AI for grading, as students and educators may neither have equitable access to the best services or fully understand how Gen AI reaches its conclusions
Feedback quality: Gen AI feedback can be generic and less personalised, potentially reducing its effectiveness in helping students improve their work
Over-reliance on algorithms: Over-reliance on Gen AI can diminish the role of human judgment and critical thinking in the assessment process, which (again) are critical skills in higher education
Data privacy issues: The use of Gen AI for grading involves handling large amounts of student data, raising concerns about data privacy and security
Technical limitations: Gen AI systems can face technical issues, such as handling different file formats, dealing with incomplete submissions, or processing handwritten work.

With these limitations in mind and a need to gauge the performance of GAI in this emotive area, I designed and executed a very quick and dirty study to experience at first hand the issues arising from the use of Gen AI to assess students’ work. The details of this are as follows:

A short paper was produced using ChatGPT4-o: Discuss the effect of the first Thatcher goverment on the UK economy today
15 copies of the paper were made and assigned to four names chosen to represent a broad student population, and one anonymous candidate
The only variable in each copy of the paper was the student name – they were otherwise identical
Three GAIs were used: Chat GPT3.5, Chat GPT4-o and Microsoft CoPilot (the Web version which uses Chat GPT4)
The copies of the paper were submitted to the three GAI services with an identical prompt to mark them
The prompt was intentionally simple and asked to mark the papers out of 100

Results

Below is a table showing the results generated for the paper by the three Gen AI services.

Student name *	Chat GPT 4-o	Chat GPT 3.5	Copilot ^#	Mean	Standard Deviation
Duriel Frimpong	96	95	100	97	2.6
Nisan Ahmad	95	91	100	95.3	4.5
Hao Haidong	93	95	90	92.6	2.5
Robert Holdsworth	95	100	100	98.3	2.9
Anonymous	90	97	80	89	8.5
Mean	93.8	95.6	94	–	–
Standard Deviation	2.4	3.3	8.9	–	–

* All names are fictitious
^# CoPilot uses Chat GPT 4

Inconsistency: The results from what was a tiny study were wildly varied. Nine different scores ranging from 80 to 100 were generated for what – authors’ names aside – were 15 copies of the same content.

The most advanced model, Chat GPT 4-o, provided the most consistent results, as evidenced by the lowest standard deviation score of the three models. CoPilot, which uses ChatGPT 4, was the most inconsistent model as indicated by the standard deviation score but was the only one which gave the same result, 100, for three out the five papers submitted to it.

The most inconsistent scores were acheived by the papers submitted to CoPilot and those submitted anonomously. Our anonymous marking policy coupled with the likely adoption of CoPilot (Web) may well mitigate potential data protection issues, but at the cost of great unfairness to students should their work be uploaded to it for marking and feedback.

Illusion of criticality: CoPilot awarded slightly higher average scores than ChatGPT 3.5, which to the untutored may suggest the faculty of criticality and degrees thereof among different Gen AI models. Chat GPT 4-o produced the lowest average scores, doing little to address the false belief that the more advanced Gen AI models possess critical skills.

Bias: The lowest average score was awarded to the anonymous papers, which may suggest that the inclusion of a student name played a part. That the highest average score was acheived with papers associated with the most recognisably anglophone name, while papers submitted with a sinephone student name were consistently awarded the lowest scores, is deeply worrying regardless of scale.

Limitations of study

This was a tiny study intended to quickly prove rather than disprove a theory
The results are, because of the nature of Gen AI, most likely impossible to fully reproduce
There is an element of Gen AI ‘marking its own homework’, although this does not explain why full marks were not then awarded to all the submissions
Known unknowns, such the ability of paid for Chat GPT4 accounts to remember user preferences and previous conversations may further impact the veracity and consistency of scoring
The study focused on numerical marks and not the written feedback

Conclusions

This is not the first or last study into the use of GAI for marking and feedback, but the results are consistent with other similar small scale studies, and research conducted into the use of GAI -content detection systems.

While GAI can produce credible rubrics, using it to mark students’ work presents data protection, ethical and quality issues. The latter of which arise from the limitations, biases and inaccuracies present in the datasets. Moreover, the output of GAI tools is based on probability and is largely irreproducable. At best, it seems that marking with GAI is akin to rolling a loaded dice. Off-the-shelf GAI services should not be used to mark student work.

Limitations of Gen AI

Results

Limitations of study

Conclusions

Share this: