How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum explores why reasoning models often struggle to learn from output-level feedback alone. When a model is in the early stages of training—a "cold start"—it often fails to produce correct answers, making it difficult for reinforcement learning (RL) to provide useful guidance. This paper introduces a new family of loss functions that helps models escape this stagnation by adjusting how much "commitment" the model shows toward the provided supervision.
The Problem of Cold-Start Stagnation
Reasoning models, such as those that generate chains of thought or proof sketches, are typically trained using reinforcement learning from verifiable rewards (RLVR). However, if the model’s initial success probability is very low, it rarely produces the correct answer, meaning it receives little to no useful signal to improve. While standard RL focuses on exploitation (maximizing current success), this approach can be prohibitively slow at the start of training. Conversely, density estimation (learning the likelihood of correct paths) can help the model learn faster but often leads to memorizing noise and errors in the training data.
The Tsallis Loss Continuum
The authors propose a unified framework using the Tsallis $q$-logarithm to bridge the gap between pure exploitation and density estimation. By adjusting a parameter $q$ (called "commitment"), researchers can move along a continuum:
Low $q$ (Exploitation): The model focuses on what it already knows, which is robust against noise but struggles to escape the cold-start phase.
High $q$ (Density Estimation): The model pushes harder toward the provided supervision, allowing it to escape cold starts much faster, though it risks memorizing incorrect labels.
Intermediate $q$: This provides a balance, allowing the model to escape cold starts while maintaining resistance to noise.
Two Ways to Estimate Gradients
Because the ideal mathematical objective is difficult to calculate directly, the authors developed two practical methods to estimate the necessary gradients:
Gradient-Amplified RL (GARL): This method samples from the model's prior and amplifies the RL gradient. It is highly effective at helping models escape cold starts, though it can become unstable during later training stages.
Posterior-Attenuated Fine-Tuning (PAFT): This method samples from the posterior—focusing on rationales that lead to the correct answer—and uses standard fine-tuning. It provides more stable, semantically coherent gradients, making it better suited for later stages of training where stability is key.
Empirical Results
The researchers tested these methods on reasoning benchmarks including FinQA, HotPotQA, and MuSiQue. They found that using GARL at an intermediate commitment level ($q=0.75$) allowed models to escape cold starts in scenarios where standard methods failed entirely. In later training stages, the researchers found that switching to PAFT provided significant stability and performance gains, achieving a notable improvement on HotPotQA compared to standard reinforcement learning approaches. The findings suggest that the optimal training strategy involves balancing the speed of learning with the stability of the gradients throughout the model's development.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!