SPIRAL: Learning to Search and Aggregate
Language models often struggle to solve complex reasoning tasks, not because they lack the necessary compute, but because they fail to use that compute effectively. While models are typically trained to produce a single, sequential chain of thought, they are often used at test time with more complex strategies, such as generating multiple parallel ideas and aggregating them into a final answer. SPIRAL (Sequential-Parallel-Aggregative Reinforcement Learning) is a new framework designed to bridge this gap by training models to coordinate these three compute primitives—sequential reasoning, parallel exploration, and aggregative synthesis—as a unified pipeline.
A Unified Approach to Reasoning
Current training methods focus on optimizing a single, linear path to a solution. SPIRAL changes this by exposing the model to the entire inference process during training. The model first generates a set of independent "search traces" in parallel. It then uses these traces as context to produce a final "aggregation trace." By optimizing the entire pipeline end-to-end against the reward of the final output, the model learns to generate search traces that are specifically useful for the aggregation phase, rather than just trying to solve the problem in isolation.
How SPIRAL Learns
To train this system, the framework employs two distinct reinforcement learning techniques:
Standard Reinforcement Learning: Used for the aggregation phase, this teaches the model how to synthesize a set of candidate traces into a high-quality final response.
Set Reinforcement Learning: Used for the parallel search phase, this assigns credit to the entire set of traces. Because the model is rewarded based on the final aggregated output, it learns to produce a diverse set of traces that, when combined, lead to the correct answer. This prevents the model from simply repeating the same, potentially flawed, reasoning path.
Scaling Performance
The researchers evaluated SPIRAL by fine-tuning a language model on mathematical reasoning tasks. The results demonstrate that SPIRAL is significantly more efficient at converting inference compute into performance compared to standard methods like GRPO. When scaling the number of parallel traces, the framework achieved up to 11 times higher scaling efficiency. Furthermore, when scaling both parallel and aggregative compute together, the model saw a 15% improvement in performance on test sets.
Why This Matters
By internalizing these search and aggregation strategies during training, SPIRAL reduces the need for complex, hand-designed scaffolds that practitioners currently use to orchestrate reasoning. Instead of relying on rigid heuristics to manage how a model thinks, the model learns to proactively allocate its "thinking budget"—deciding when to explore broadly, when to refine ideas, and how to synthesize disparate information to reach a correct conclusion.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!