Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling
Large Language Models (LLMs) are powerful reasoning tools, but they often require massive computational resources to perform at their best. Current methods for scaling these models during inference—the process of generating answers—often struggle to balance the "budget" (how much computing power is used) with the "quality" of the reasoning. This paper introduces Dual-Dimensional Consistency (DDC), a framework that optimizes both the width of sampling (how many different paths are explored) and the depth of reasoning (how carefully each path is verified) to achieve high accuracy with significantly lower computational costs.
A Unified Approach to Scaling
Existing strategies typically treat the number of reasoning paths and the quality of those paths as separate, unrelated goals. Some methods focus on "width," using majority voting to reach a consensus, which can be inefficient and risky if the model consistently agrees on a wrong answer. Others focus on "depth," pruning individual paths that seem incorrect, but these often waste resources on simple queries or accidentally discard valid, complex reasoning. DDC bridges these two dimensions by using a shared set of quality signals to decide when to stop searching and which paths are worth keeping, ensuring that computational resources are focused only on the most promising reasoning chains.
How DDC Works
The DDC framework uses two primary mechanisms to manage resources dynamically:
Confidence-Weighted Bayesian Termination: Instead of just counting how many paths agree on an answer, DDC treats the consensus process as a Bayesian decision problem. It assigns a "confidence weight" to each path based on its quality. The system only stops generating new paths when it has reached an absolute majority backed by high-confidence evidence, preventing the model from settling for a "consensus" that might actually be a hallucination.
Trend-Aware Stratified Pruning: To improve the quality of individual reasoning paths, DDC monitors the generation process in real-time. It uses a mathematical technique to analyze the "velocity" and "position" of confidence levels within a path. This allows the system to distinguish between a temporary dip in confidence—which might happen during a difficult but correct step—and a sustained, chaotic decline that indicates a hallucination. By identifying these trends, the model can prune low-quality paths early while preserving complex, valid ones.
Efficiency and Performance
The researchers tested DDC across five challenging reasoning benchmarks using various Qwen models. The results demonstrate that DDC significantly outperforms traditional static scaling methods. On average, the framework reduces token consumption by more than 10 times compared to strong baselines while maintaining or even exceeding their accuracy. For example, on the AIME25 benchmark using the Qwen3-4B model, DDC achieved a 15.6% gain in accuracy while reducing token usage by approximately 27 times.
Key Takeaways
The DDC framework proves that inference-time scaling does not have to be a choice between high costs and high accuracy. By integrating path-level quality metrics directly into the decision-making process, the model becomes "aware" of its own reasoning performance. This allows it to allocate compute power where it is needed most—on complex, high-stakes reasoning—while aggressively filtering out noise and redundant calculations. This approach provides a more efficient, reliable way to scale LLM reasoning capabilities without the need for additional training.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!