Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach
This research investigates whether Large Language Models (LLMs) can reliably grade student responses in Linux/bash command-line examinations. As student enrollment in computer science grows, manual grading has become increasingly difficult, and traditional automated systems often fail to account for the nuances of command-line syntax, such as multiple correct ways to solve a problem or the need for partial credit. The study evaluates four leading AI models to see if they can match the judgment of expert human instructors when using a structured, taxonomy-based grading framework.
A Four-Level Grading Framework
To evaluate the complexity of command-line tasks, the researchers implemented a "CogTax" framework. This taxonomy categorizes questions based on two factors: cognitive complexity (the mental effort required) and operational impact (the potential for the command to change the system). The four levels are:
Level 1 (Information Query): Basic read-only commands like ls or cat.
Level 2 (Basic Modifications): Commands that create or move files, such as mkdir or cp.
Level 3 (Structural Understanding): Tasks involving pipelines, permissions, or conditional logic.
Level 4 (Advanced System Management): Complex, multi-step operations like process management.
Testing AI Against Human Experts
The researchers tested four models—GPT, Claude Opus, Gemini, and GLM—using 1,200 real student responses from a second-year Computer Engineering course. Each response was graded independently by three human instructors to create a reliable baseline. The AI models were tested using two different prompting strategies: a minimal baseline and a rubric-enhanced version that provided specific guidelines for scoring.
Key Findings
The study found that Gemini 3.0 Pro, when paired with a detailed, rubric-guided prompt, achieved the highest level of agreement with human instructors. A significant takeaway is that the quality of the rubric provided to the AI had a much larger impact on grading accuracy than the choice of the AI model itself. However, the research also observed that as the complexity of the command-line task increased (moving from Level 1 to Level 4), the agreement between the AI and human graders consistently declined.
Practical Implications
The results suggest that while LLMs are effective at grading simpler, lower-level command-line tasks, they struggle more with advanced, high-complexity operations. This indicates that AI can serve as a powerful tool for scaling education, but it should be used within a hybrid framework. By using the taxonomy-based approach, educators can identify which questions are suitable for automated grading and which ones require human oversight, ensuring that students receive both efficient feedback and accurate, fair assessment.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!