Back to AI Research

AI Research

When to Vote, When to Rewrite: Disagreement-Guided... | AI Research

Key Takeaways

  • Large Reasoning Models (LRMs) are powerful, but they often struggle with complex, challenging problems.
  • Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances.
  • Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems.
  • We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time.
  • The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances.
Paper AbstractExpand

Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.

Large Reasoning Models (LRMs) are powerful, but they often struggle with complex, challenging problems. While techniques like repeated sampling or tree search can help, they often waste computational resources by applying the same heavy-duty strategy to every problem, regardless of its difficulty. This paper introduces a training-free framework that treats test-time scaling as a routing problem, dynamically selecting the most efficient strategy based on how much the model disagrees with itself.

Identifying Difficulty Through Disagreement

The core insight of this research is that "output disagreement"—how often a model produces different answers for the same problem—is a reliable indicator of both problem difficulty and the likelihood of an incorrect prediction. When a model consistently produces the same answer, the problem is likely easy, and no extra work is needed. When the model produces conflicting answers, it signals that the problem is ambiguous or difficult, requiring a more sophisticated approach.

A Three-Stage Routing Strategy

Instead of blindly applying one method to every task, the framework routes instances through three stages based on the level of disagreement detected:

  • Disagreement Filter: The model performs two initial samplings. If the answers match, the problem is considered "easy," and the result is accepted immediately, saving computational power.

  • Vote Resolve: If there is minor disagreement, the model performs additional sampling. It then uses majority voting to select the most reliable answer from the combined pool of results.

  • Rewrite & Rethink: For instances with severe, persistent disagreement, the model reformulates the problem statement. By changing the surface expression of the question while keeping the underlying meaning, the model can often escape the incorrect reasoning path that led to the initial confusion.

Efficiency and Performance Gains

By intelligently routing problems, the framework avoids redundant computations on simple tasks and focuses resources where they are most needed. Experiments across seven mathematical benchmarks and three different models show that this approach improves accuracy by 3% to 7% compared to traditional methods. Notably, these gains are achieved while using fewer total samplings, making the process both more accurate and more efficient.

Broader Applicability

The researchers also tested their framework on code generation tasks, where they measured disagreement based on whether different code snippets produced the same functional output. The results suggest that this strategy-routing approach is not limited to math; it effectively improves performance in other reasoning-heavy domains, proving that a flexible, uncertainty-aware strategy is often more effective than a "one-size-fits-all" approach to test-time scaling.

Comments (0)

No comments yet

Be the first to share your thoughts!