Back to AI Research

AI Research

SPIRAL: Learning to Search and Aggregate | AI Research

Key Takeaways

  • SPIRAL: Learning to Search and Aggregate Language models often struggle to solve complex reasoning tasks, not because they lack the necessary compute, but be...
  • During post-training, however, language models are optimized only for sequential reasoning within a single trace.
  • We introduce Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL), a framework in which a language model is trained to use all three primitives, as part of a unified inference compute pipeline.
  • SPIRAL: Learning to Search and Aggregate Language models often struggle to solve complex reasoning tasks, not because they lack the necessary compute, but because they fail to use that compute effectively.
  • SPIRAL: Learning to Search and Aggregate
Paper AbstractExpand

Language model reasoning can be substantially improved at test time via scaffolds that scale inference compute across different primitives -- sequential reasoning within a trace, independently sampled parallel traces, and aggregation of multiple reasoning traces into a final response. During post-training, however, language models are optimized only for sequential reasoning within a single trace. We introduce Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL), a framework in which a language model is trained to use all three primitives, as part of a unified inference compute pipeline. Concretely, the language model first samples a set of independent traces in parallel, each produced through sequential chain-of-thought reasoning, and then generates a final aggregation trace conditioned on those traces; all components are optimized end-to-end against the reward of the final aggregated response. To train this system, SPIRAL uses set reinforcement learning to teach models to produce a set of traces that are collectively useful for an aggregator and standard reinforcement learning to teach models to aggregate the set into improved final responses. Our experiments on reasoning tasks show that SPIRAL effectively scales with inference compute, outperforming GRPO by up to 11$\times$ scaling efficiency and 15% higher performance when all three compute primitives are scaled.

SPIRAL: Learning to Search and Aggregate
Language models often struggle to solve complex reasoning tasks, not because they lack the necessary compute, but because they fail to use that compute effectively. While models are typically trained to produce a single, sequential chain of thought, they are often used at test time with more complex strategies, such as generating multiple parallel ideas and aggregating them into a final answer. SPIRAL (Sequential-Parallel-Aggregative Reinforcement Learning) is a new framework designed to bridge this gap by training models to coordinate these three compute primitives—sequential reasoning, parallel exploration, and aggregative synthesis—as a unified pipeline.

A Unified Approach to Reasoning

Current training methods focus on optimizing a single, linear path to a solution. SPIRAL changes this by exposing the model to the entire inference process during training. The model first generates a set of independent "search traces" in parallel. It then uses these traces as context to produce a final "aggregation trace." By optimizing the entire pipeline end-to-end against the reward of the final output, the model learns to generate search traces that are specifically useful for the aggregation phase, rather than just trying to solve the problem in isolation.

How SPIRAL Learns

To train this system, the framework employs two distinct reinforcement learning techniques:

  • Standard Reinforcement Learning: Used for the aggregation phase, this teaches the model how to synthesize a set of candidate traces into a high-quality final response.

  • Set Reinforcement Learning: Used for the parallel search phase, this assigns credit to the entire set of traces. Because the model is rewarded based on the final aggregated output, it learns to produce a diverse set of traces that, when combined, lead to the correct answer. This prevents the model from simply repeating the same, potentially flawed, reasoning path.

Scaling Performance

The researchers evaluated SPIRAL by fine-tuning a language model on mathematical reasoning tasks. The results demonstrate that SPIRAL is significantly more efficient at converting inference compute into performance compared to standard methods like GRPO. When scaling the number of parallel traces, the framework achieved up to 11 times higher scaling efficiency. Furthermore, when scaling both parallel and aggregative compute together, the model saw a 15% improvement in performance on test sets.

Why This Matters

By internalizing these search and aggregation strategies during training, SPIRAL reduces the need for complex, hand-designed scaffolds that practitioners currently use to orchestrate reasoning. Instead of relying on rigid heuristics to manage how a model thinks, the model learns to proactively allocate its "thinking budget"—deciding when to explore broadly, when to refine ideas, and how to synthesize disparate information to reach a correct conclusion.

Comments (0)

No comments yet

Be the first to share your thoughts!