Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
This paper addresses a critical bottleneck in how we measure the intelligence of Large Language Models (LLMs) in mathematics. Currently, most benchmarks rely on symbolic verification—tools that check if a model’s answer matches a ground truth by treating it as a rigid mathematical string. The authors argue that this method is fundamentally flawed because it often marks correct answers as "wrong" simply because they are formatted differently, use different units, or represent the same value in an alternative notation. To solve this, the researchers propose an "LLM-as-a-judge" framework that uses the semantic understanding of a secondary, stronger LLM to evaluate whether a model’s answer is mathematically correct, regardless of its specific format.
The Problem with Symbolic Rigidity
Standard evaluation frameworks, such as Lighteval and SimpleRL, rely on tools like SymPy to compare model outputs against a reference answer. This approach is brittle; it fails when a model provides an answer like "1000 rad/s" instead of the expected "1000," or when it uses different variable names or derivative notations. Because these systems look for an exact symbolic match, they frequently under-evaluate models, leading to inaccurate performance monitoring. This is particularly problematic for Reinforcement Learning with Verifiable Rewards (RLVR), where the training process itself relies on these flawed automated rewards to teach the model.
A More Flexible Evaluation Pipeline
The proposed framework replaces rigid matching with a multi-stage, LLM-based verification process. First, the system performs independent question answering, where a strong judge model solves the problem without seeing the ground truth, reducing bias toward potentially incorrect dataset labels. Second, it validates the dataset’s own ground truth by comparing it against the judge’s independent solution. Finally, the framework evaluates the model’s predictions by analyzing them for semantic correctness. To ensure reliability, the team uses majority voting across multiple assessments and randomizes the order of responses to prevent "positional bias," where a judge might unfairly favor an answer based on its placement in a list.
Significant Gains in Accuracy
The researchers tested their framework across several benchmarks, including GSM8K, Minerva, Math500, and Olympiad-level datasets. The results show a consistent and significant improvement in evaluation accuracy compared to traditional symbolic methods. For instance, on the Minerva dataset, the framework identified many more correct answers that symbolic tools had previously discarded, leading to higher performance scores. By meta-evaluating the system against a manually labeled set of 640 model responses, the authors confirmed that their approach is not only more robust but also more capable of handling the diverse ways humans and models express mathematical truths.
Considerations for Implementation
While the LLM-as-a-judge approach is more flexible, the authors acknowledge that LLMs can have inherent biases, such as favoring certain response positions or being influenced by the ground truth provided in a prompt. To mitigate these, the framework incorporates specific design choices, such as independent answering and majority voting. Furthermore, the authors note that in cases where a question is ambiguous or the ground truth is fundamentally incorrect, the system marks the case as "lacking evaluation applicability" rather than forcing a potentially wrong judgment. This prioritizes the reliability of the evaluation over the need to force a score for every single question.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!