VeriEvol: Scaling Multimodal Mathematical Reasoning...

VeriEvol: Scaling Multimodal Mathematical Reasoning... | AI Research

Key Takeaways

VeriEvol is a framework designed to improve how AI models learn visual mathematical reasoning.
Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable.
Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct.
The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes.
We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.

Paper AbstractExpand

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.

VeriEvol is a framework designed to improve how AI models learn visual mathematical reasoning. As models are trained on increasingly large datasets, the quality of the "answer keys" used for training becomes a critical bottleneck. If these labels are incorrect or unreliable, the model learns to repeat those mistakes. VeriEvol addresses this by treating data construction as a two-part challenge: making questions more difficult through systematic evolution and ensuring answers are verified through a rigorous, independent falsification process before they are ever used to train a model.

Evolving Question Difficulty

To move beyond simple questions that models can answer using basic text knowledge, VeriEvol uses a "type-aware" evolution module. Instead of applying a generic "make this harder" command to every image, the system categorizes questions into specific families—such as geometry, charts, or OCR tasks. It then uses specialized operators to rewrite these seeds into more complex, image-grounded prompts. A strict filtering gate ensures that the new questions actually require the image to be solved, preventing the model from relying on text-based shortcuts.

The HTV-Agent Verification Process

The core innovation of the framework is the HTV-Agent, a "hypothesis-test" verifier. Rather than simply trusting an initial answer, the system treats every generated answer as a hypothesis that must be proven. It uses multiple independent solvers to generate potential answers and then employs a series of "refutation" channels. These channels use code-based logic and visual analysis (such as checking bounding boxes or pixel-level data) to actively look for reasons why an answer might be wrong. Only if an answer survives these attempts at refutation—and passes a final, deterministic consensus check—is it accepted into the training data.

Scaling and Performance

The researchers found that this approach scales effectively. By increasing the volume of verified data, they observed consistent improvements in model performance across five different visual-math benchmarks. When keeping the model architecture and training recipe constant, the VeriEvol approach provided a significant boost in accuracy compared to un-evolved baselines. This gain was attributed to both the higher quality of the evolved prompts and the reliability provided by the HTV-Agent verifier.

Transparency and Traceability

A key feature of VeriEvol is its commitment to transparency. The authors have released the full "verifier trace" for every sample, which includes the original solver hypotheses, the counter-evidence reports, and the final decision-making rationale. By providing this level of detail, the researchers aim to allow other developers to audit the construction process, understand why specific data points were included or rejected, and extend the pipeline for future research rather than simply using the final model outputs.