Applying Large Language Models to Enhance the Assessment of Java Programming Assignments
This program is tentative and subject to change.
The assessment of programming assignments in computer science (CS) education traditionally relies on manual grading, which strives to provide comprehensive feedback on correctness, style, efficiency, and other software quality attributes. As class sizes increase, however, it is hard to provide detailed feedback consistently, especially when multiple assessors are required to handle a larger number of assignment submissions. Large Language Models (LLMs) such as ChatGPT offer a promising alternative to help automate this process in a consistent, scalable, and fair manner.
This paper explores the efficacy of ChatGPT-4 and other popular LLMs in automating programming assignment evaluation. We conduct a series of studies within multiple Java-based CS courses, comparing LLM-generated assessments to those produced by human graders. The analysis focuses on key metrics, such as accuracy, precision, recall, efficiency, and consistency, to identify programming mistakes based on predefined rubrics. Our findings demonstrate that, with appropriate prompt engineering and feature selection, LLMs improve grading objectivity and efficiency, serving as a valuable complementary tool to human graders in undergraduate and graduate CS education.