Back to AI Research

AI Research

How Fast Should a Model Commit to Supervision? Trai... | AI Research

Key Takeaways

  • How Fast Should a Model Commit to Supervision?
  • Training Reasoning Models on the Tsallis Loss Continuum explores why reasoning models often struggle to learn...
  • Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small.
  • All members share the same per-example gradient direction, differing only by a scalar amplification $P_{\theta^{-q}}$ that reweights each instance independently of the learning rate.
  • Both have bias $O\big(\frac{q}{M P_{\theta}^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients.
Paper AbstractExpand

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{\theta^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $\Omega(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $\Theta\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_\theta$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_{\theta}^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum explores why reasoning models often struggle to learn from output-level feedback alone. When a model is in the early stages of training—a "cold start"—it often fails to produce correct answers, making it difficult for reinforcement learning (RL) to provide useful guidance. This paper introduces a new family of loss functions that helps models escape this stagnation by adjusting how much "commitment" the model shows toward the provided supervision.

The Problem of Cold-Start Stagnation

Reasoning models, such as those that generate chains of thought or proof sketches, are typically trained using reinforcement learning from verifiable rewards (RLVR). However, if the model’s initial success probability is very low, it rarely produces the correct answer, meaning it receives little to no useful signal to improve. While standard RL focuses on exploitation (maximizing current success), this approach can be prohibitively slow at the start of training. Conversely, density estimation (learning the likelihood of correct paths) can help the model learn faster but often leads to memorizing noise and errors in the training data.

The Tsallis Loss Continuum

The authors propose a unified framework using the Tsallis $q$-logarithm to bridge the gap between pure exploitation and density estimation. By adjusting a parameter $q$ (called "commitment"), researchers can move along a continuum:

  • Low $q$ (Exploitation): The model focuses on what it already knows, which is robust against noise but struggles to escape the cold-start phase.

  • High $q$ (Density Estimation): The model pushes harder toward the provided supervision, allowing it to escape cold starts much faster, though it risks memorizing incorrect labels.

  • Intermediate $q$: This provides a balance, allowing the model to escape cold starts while maintaining resistance to noise.

Two Ways to Estimate Gradients

Because the ideal mathematical objective is difficult to calculate directly, the authors developed two practical methods to estimate the necessary gradients:

  • Gradient-Amplified RL (GARL): This method samples from the model's prior and amplifies the RL gradient. It is highly effective at helping models escape cold starts, though it can become unstable during later training stages.

  • Posterior-Attenuated Fine-Tuning (PAFT): This method samples from the posterior—focusing on rationales that lead to the correct answer—and uses standard fine-tuning. It provides more stable, semantically coherent gradients, making it better suited for later stages of training where stability is key.

Empirical Results

The researchers tested these methods on reasoning benchmarks including FinQA, HotPotQA, and MuSiQue. They found that using GARL at an intermediate commitment level ($q=0.75$) allowed models to escape cold starts in scenarios where standard methods failed entirely. In later training stages, the researchers found that switching to PAFT provided significant stability and performance gains, achieving a notable improvement on HotPotQA compared to standard reinforcement learning approaches. The findings suggest that the optimal training strategy involves balancing the speed of learning with the stability of the gradients throughout the model's development.

Comments (0)

No comments yet

Be the first to share your thoughts!