Back to AI Research

AI Research

Automated grading of Linux/bash examinations using... | AI Research

Key Takeaways

  • Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach This research investigates whether Large L...
  • This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses.
  • The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 real responses from second-year Computer Engineering students independently graded by three expert instructors.
  • Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014).
  • Agreement declined consistently as taxonomy level increased, with the largest discrepancies at higher levels.
Paper AbstractExpand

Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic file manipulation (L2) to structural operations (L3) and advanced system management (L4). The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 real responses from second-year Computer Engineering students independently graded by three expert instructors. Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014). Agreement declined consistently as taxonomy level increased, with the largest discrepancies at higher levels. Across all models, rubric quality had a larger effect than provider choice, with structured prompts consistently improving agreement. These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately, and they establish a principled, taxonomy-based framework for determining which questions are suitable for AI-assisted grading and which require human review, while also providing a transferable evaluation protocol and prompt templates.

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach
This research investigates whether Large Language Models (LLMs) can reliably grade student responses in Linux/bash command-line examinations. As student enrollment in computer science grows, manual grading has become increasingly difficult, and traditional automated systems often fail to account for the nuances of command-line syntax, such as multiple correct ways to solve a problem or the need for partial credit. The study evaluates four leading AI models to see if they can match the judgment of expert human instructors when using a structured, taxonomy-based grading framework.

A Four-Level Grading Framework

To evaluate the complexity of command-line tasks, the researchers implemented a "CogTax" framework. This taxonomy categorizes questions based on two factors: cognitive complexity (the mental effort required) and operational impact (the potential for the command to change the system). The four levels are:

  • Level 1 (Information Query): Basic read-only commands like ls or cat.

  • Level 2 (Basic Modifications): Commands that create or move files, such as mkdir or cp.

  • Level 3 (Structural Understanding): Tasks involving pipelines, permissions, or conditional logic.

  • Level 4 (Advanced System Management): Complex, multi-step operations like process management.

Testing AI Against Human Experts

The researchers tested four models—GPT, Claude Opus, Gemini, and GLM—using 1,200 real student responses from a second-year Computer Engineering course. Each response was graded independently by three human instructors to create a reliable baseline. The AI models were tested using two different prompting strategies: a minimal baseline and a rubric-enhanced version that provided specific guidelines for scoring.

Key Findings

The study found that Gemini 3.0 Pro, when paired with a detailed, rubric-guided prompt, achieved the highest level of agreement with human instructors. A significant takeaway is that the quality of the rubric provided to the AI had a much larger impact on grading accuracy than the choice of the AI model itself. However, the research also observed that as the complexity of the command-line task increased (moving from Level 1 to Level 4), the agreement between the AI and human graders consistently declined.

Practical Implications

The results suggest that while LLMs are effective at grading simpler, lower-level command-line tasks, they struggle more with advanced, high-complexity operations. This indicates that AI can serve as a powerful tool for scaling education, but it should be used within a hybrid framework. By using the taxonomy-based approach, educators can identify which questions are suitable for automated grading and which ones require human oversight, ensuring that students receive both efficient feedback and accurate, fair assessment.

Comments (0)

No comments yet

Be the first to share your thoughts!