Back to AI Research

AI Research

Human-in-the-Loop Benchmarking of Heterogeneous LLM... | AI Research

Key Takeaways

  • Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics This research addresses the challenge...
  • As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators.
  • This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment.
  • The findings show a marked "Architecture-compatibility gap".
  • We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a "Human-in-the-Loop" framework.
Paper AbstractExpand

As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models -- Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) -- and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652). The findings show a marked "Architecture-compatibility gap". Although the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved "Fair Agreement" (kappa_w ~ 0.38), the larger Orion (70B) model exhibited "No Agreement" (kappa_w = -0.0261), suggesting that architectural compliance with instruction constraints outweighs the scale of raw parameters in rubric-constrained tasks. We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a "Human-in-the-Loop" framework.

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
This research addresses the challenge of moving from traditional, mark-based grading to qualitative "Competency-Based Education" (CBE). Because mapping student performance to specific competencies is labor-intensive for teachers, the authors propose a "Human-in-the-Loop" framework that uses Large Language Models (LLMs) to assist in the assessment process. By testing various AI models against a ground truth established by senior mathematics faculty, the study explores whether AI can reliably evaluate student reasoning in secondary-level mathematics.

A New Framework for Mathematical Assessment

The researchers developed a competency-based assessment framework tailored to the Grade 10 Optional Mathematics curriculum in Nepal. Instead of focusing solely on whether a final answer is correct, the framework evaluates four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. Each competency is measured across four levels of proficiency—Awareness, Application, Mastery, and Influence—using a rubric designed to capture a student's reasoning process, partial solutions, and strategic thinking.

Testing AI Against Human Experts

To validate the system, the team collected handwritten math scripts from 33 students. Two senior mathematics faculty members graded these scripts independently to create a "ground truth" baseline, achieving high agreement (kappa_w = 0.8652). The researchers then tasked an ensemble of four LLMs—Eagle (Llama 3.1-8B), Orion (Llama 3.3-70B), Nova (Gemini 2.5 Flash), and Lyra (Gemini 3 Pro)—to evaluate the same scripts using the established rubric. All models were set to a low temperature of 0.1 to ensure consistency in their outputs.

The Architecture-Compatibility Gap

The study revealed a surprising "Architecture-compatibility gap." While the Gemini-based Mixture-of-Experts (MoE) models achieved "Fair Agreement" with the human experts (kappa_w ≈ 0.38), the much larger Orion 70B model showed "No Agreement" (kappa_w = -0.0261). This suggests that for rubric-constrained tasks, a model's ability to follow specific instructional constraints is more important than its total number of parameters. The findings indicate that while LLMs are not yet ready to handle autonomous certification, they are effective tools for preliminary evidence extraction and can provide high-value support to educators.

Limitations and Future Outlook

The authors emphasize that current AI systems still face a "black box" problem, where the underlying logic for a grade is not always transparent, which can hinder teacher trust. Additionally, the study highlights that LLMs can struggle with long contexts, sometimes leading to hallucinations or incorrect grading when processing large amounts of information. The researchers conclude that the future of automated assessment lies in a collaborative approach where AI handles the heavy lifting of evidence gathering, while human experts remain the final authority in the evaluation process.

Comments (0)

No comments yet

Be the first to share your thoughts!