Site icon E-Learning Team Blog

Marking with AI: a quick and dirty experiment

Enthusiastic thought leaders and numerous tech bros on Twitter/X are keen to encourage and normalise the offloading of work to Generative Artificial Intelligence (Gen AI). But is there a risk of automating the wrong activities and thereby failing to realise the potential that such technology promises? Author Joanna Maciejewska summarises her concerns thus:

I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes.”
https://x.com/AuthorJMac/status/1773679197631701238

As marking is both a time-consuming and time-sensitive activity it might be a tempting place to start. But is automating the marking and feedback process through the use of GAI really a good idea? The value of quality feedback to students is clear to many, including Advance HE:

Empowering and engaging learners through assessment design and providing opportunities for dialogic feedback is central to learning and the student experience.
https://www.advance-he.ac.uk/teaching-and-learning/assessment-and-feedback-higher-education

As an insititution with ambitions to deliver a high-quality learning experience and reduce the disparities in student outcomes, marking and feedback are clearly our ‘art and writing’. So, one argument is that GAI could take care of the heavy lifting element of marking papers and thereby free up time for teachers to offer a more personal and conversational approach to feedback. A counter argument is that feedback, something so human and important, should not be entrusted to technology, especially one that presents so many known challenges.

Limitations of Gen AI

The known limitations of Gen AI are well documented and discussed. Using marking and feedback as a lens with which to examine these highlights a number of intersecting issues:

With these limitations in mind and a need to gauge the performance of GAI in this emotive area, I designed and executed a very quick and dirty study to experience at first hand the issues arising from the use of Gen AI to assess students’ work. The details of this are as follows:

Results

Below is a table showing the results generated for the paper by the three Gen AI services.

Student name *Chat GPT 4-oChat GPT 3.5Copilot #MeanStandard Deviation
Duriel Frimpong9695100972.6
Nisan Ahmad959110095.34.5
Hao Haidong93959092.62.5
Robert Holdsworth9510010098.32.9
Anonymous909780898.5
Mean93.895.694
Standard Deviation2.43.38.9
* All names are fictitious
# CoPilot uses Chat GPT 4

Inconsistency: The results from what was a tiny study were wildly varied. Nine different scores ranging from 80 to 100 were generated for what – authors’ names aside – were 15 copies of the same content.

The most advanced model, Chat GPT 4-o, provided the most consistent results, as evidenced by the lowest standard deviation score of the three models. CoPilot, which uses ChatGPT 4, was the most inconsistent model as indicated by the standard deviation score but was the only one which gave the same result, 100, for three out the five papers submitted to it.

The most inconsistent scores were acheived by the papers submitted to CoPilot and those submitted anonomously. Our anonymous marking policy coupled with the likely adoption of CoPilot (Web) may well mitigate potential data protection issues, but at the cost of great unfairness to students should their work be uploaded to it for marking and feedback.

Illusion of criticality: CoPilot awarded slightly higher average scores than ChatGPT 3.5, which to the untutored may suggest the faculty of criticality and degrees thereof among different Gen AI models. Chat GPT 4-o produced the lowest average scores, doing little to address the false belief that the more advanced Gen AI models possess critical skills.

Bias: The lowest average score was awarded to the anonymous papers, which may suggest that the inclusion of a student name played a part. That the highest average score was acheived with papers associated with the most recognisably anglophone name, while papers submitted with a sinephone student name were consistently awarded the lowest scores, is deeply worrying regardless of scale.

Limitations of study

Conclusions

This is not the first or last study into the use of GAI for marking and feedback, but the results are consistent with other similar small scale studies, and research conducted into the use of GAI -content detection systems.

While GAI can produce credible rubrics, using it to mark students’ work presents data protection, ethical and quality issues. The latter of which arise from the limitations, biases and inaccuracies present in the datasets. Moreover, the output of GAI tools is based on probability and is largely irreproducable. At best, it seems that marking with GAI is akin to rolling a loaded dice. Off-the-shelf GAI services should not be used to mark student work.

Exit mobile version