Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
This research investigates how well Large Language Models (LLMs) perform as intelligent tutoring agents in the field of propositional logic. Effective tutoring requires a teacher to distinguish between three types of student actions: optimal steps, valid but suboptimal reasoning, and incorrect steps. While LLMs are increasingly used to provide feedback, this study evaluates whether they can accurately diagnose these different categories and provide helpful, pedagogically sound guidance.
Evaluating Diagnostic Precision
The researchers created a benchmark using 10,836 solution-feedback pairs to test seven different LLMs. They utilized a knowledge graph (KG) derived from an existing tutoring system to establish a "ground truth" for every possible step in a logic proof. This allowed the team to see if the models could correctly categorize student work. The study tested three different feedback conditions: a "Peer" (who sees only the student's answer), a "Teacher" (who sees the full context of the problem), and a "Judge" (who reviews the feedback provided by others).
Key Findings on Model Performance
The results show that while LLMs are excellent at identifying optimal steps, they struggle significantly with the nuances of tutoring. The models exhibited two major, systematic failures: they frequently over-rejected valid but suboptimal reasoning and over-validated incorrect solutions. These errors occurred regardless of the model used or the specific context of the problem, suggesting that the issue is an architectural limitation of current LLMs rather than a lack of information. Essentially, the models were good at confirming what was already correct but failed to provide the nuanced guidance needed when a student took a "valid-alternative" path or made a mistake.
The Gap Between Diagnosis and Instruction
A critical discovery of this study is that even when an LLM correctly identifies a student's error, it does not necessarily provide helpful feedback. The researchers found a disconnect between diagnostic judgment and instructional effectiveness. A model might correctly label a solution as incorrect but still fail to offer actionable advice, instead providing vague responses or accidentally revealing the answer. This confirms that accurate diagnosis is only one part of the tutoring process; the ability to scaffold learning—guiding a student without simply giving them the solution—remains a significant hurdle.
Implications for Future Tutoring Systems
The findings suggest that current LLMs are not yet ready to replace traditional intelligent tutoring systems. Instead, the authors propose a hybrid architecture. In this model, a knowledge-graph-grounded system would handle the technical diagnosis of whether a student's step is valid or incorrect, while the LLM would be used for its strengths: facilitating open-ended dialogue and providing conversational scaffolding. This approach would leverage the precision of structured logic systems while maintaining the natural, conversational benefits of AI.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!