Human-in-the-Loop Benchmarking of Heterogeneous LLM...

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
This research addresses the challenge of moving from traditional, mark-based grading to qualitative "Competency-Based Education" (CBE). Because mapping student performance to specific competencies is labor-intensive for teachers, the authors propose a "Human-in-the-Loop" framework that uses Large Language Models (LLMs) to assist in the assessment process. By testing various AI models against a ground truth established by senior mathematics faculty, the study explores whether AI can reliably evaluate student reasoning in secondary-level mathematics.

A New Framework for Mathematical Assessment

The researchers developed a competency-based assessment framework tailored to the Grade 10 Optional Mathematics curriculum in Nepal. Instead of focusing solely on whether a final answer is correct, the framework evaluates four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. Each competency is measured across four levels of proficiency—Awareness, Application, Mastery, and Influence—using a rubric designed to capture a student's reasoning process, partial solutions, and strategic thinking.

Testing AI Against Human Experts

To validate the system, the team collected handwritten math scripts from 33 students. Two senior mathematics faculty members graded these scripts independently to create a "ground truth" baseline, achieving high agreement (kappa_w = 0.8652). The researchers then tasked an ensemble of four LLMs—Eagle (Llama 3.1-8B), Orion (Llama 3.3-70B), Nova (Gemini 2.5 Flash), and Lyra (Gemini 3 Pro)—to evaluate the same scripts using the established rubric. All models were set to a low temperature of 0.1 to ensure consistency in their outputs.

The Architecture-Compatibility Gap

The study revealed a surprising "Architecture-compatibility gap." While the Gemini-based Mixture-of-Experts (MoE) models achieved "Fair Agreement" with the human experts (kappa_w ≈ 0.38), the much larger Orion 70B model showed "No Agreement" (kappa_w = -0.0261). This suggests that for rubric-constrained tasks, a model's ability to follow specific instructional constraints is more important than its total number of parameters. The findings indicate that while LLMs are not yet ready to handle autonomous certification, they are effective tools for preliminary evidence extraction and can provide high-value support to educators.

Limitations and Future Outlook

The authors emphasize that current AI systems still face a "black box" problem, where the underlying logic for a grade is not always transparent, which can hinder teacher trust. Additionally, the study highlights that LLMs can struggle with long contexts, sometimes leading to hallucinations or incorrect grading when processing large amounts of information. The researchers conclude that the future of automated assessment lies in a collaborative approach where AI handles the heavy lifting of evidence gathering, while human experts remain the final authority in the evaluation process.

Human-in-the-Loop Benchmarking of Heterogeneous LLM... | AI Research

Key Takeaways

A New Framework for Mathematical Assessment

Testing AI Against Human Experts

The Architecture-Compatibility Gap

Limitations and Future Outlook

Comments (0)

No comments yet