Back to AI Research

AI Research

PyraMathBench: Evaluating and Improving Mathematica... | AI Research

Key Takeaways

  • PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models aims to solve a critical problem in artificial intelligence: why do...
  • We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities.
  • Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions.
  • Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.
  • PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models aims to solve a critical problem in artificial intelligence: why do large language models (LLMs) fail at math?
Paper AbstractExpand

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models aims to solve a critical problem in artificial intelligence: why do large language models (LLMs) fail at math? While many benchmarks test whether a model gets the right answer, they rarely explain why it got the wrong one. This paper introduces a new, hierarchical benchmark designed to break down complex math problems into smaller, foundational skills, allowing researchers to pinpoint exactly where a model’s reasoning breaks down—whether it is in basic calculation, understanding the question, or parsing numerical data.

A Hierarchical Approach to Math

Current benchmarks often treat math problems as a single, monolithic task. PyraMathBench (PMB) takes a different approach by organizing 32,505 questions into a "pyramid" structure derived from 7,404 real-world math word problems. It categorizes these into four cognitive aspects: complex reasoning, understanding, calculation, and numerical parsing. By decomposing problems into 14 distinct subtasks, the benchmark can isolate whether a model’s failure is due to a lack of logical reasoning or simply an inability to handle basic numerical inputs.

Identifying Model Weaknesses

The researchers evaluated 11 state-of-the-art LLMs using this new framework. The results revealed that even high-performing models struggle with abstraction and factual retrieval. A particularly striking finding is that multimodal models (those that can "see" images) perform poorly on visual numerical tasks. Even when these models recognize digits in an image, they often fail to identify which numbers are actually relevant to solving the problem, leading to "hallucinations" where the model gets distracted by redundant information.

Improving Performance with SOLVE and IRPO

To fix these issues, the authors developed two new tools: the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO). While many models use external tools (like calculators) to help with math, they often call these tools unnecessarily or fail to format their requests correctly. SOLVE acts as a filter that assesses the difficulty of a question, bypassing tool calls when they aren't needed to avoid errors. IRPO is a training framework that further refines how the model interacts with these tools. Together, these methods helped the Qwen-2.5 model achieve a 5.0 score improvement, demonstrating that better tool management is key to boosting mathematical accuracy.

Key Takeaways

The study highlights that mathematical ability in LLMs is not just about raw computing power, but about the synergy between understanding, logic, and tool usage. The authors note that excessive fine-tuning can sometimes hinder a model's general reasoning, and that current models still struggle significantly with distinguishing useful data from noise. By providing a granular way to measure these specific failures, PyraMathBench offers a roadmap for building more reliable and interpretable mathematical AI.

Comments (0)

No comments yet

Be the first to share your thoughts!