AI News

NVIDIA Integrates Speculative Decoding into NeMo RL for Faster Training

May 2, 2026 • Developer Tools • AI Research • AI Models

NVIDIA's new NeMo RL v0.6.0 update uses speculative decoding to accelerate reinforcement learning rollouts by up to 1.8x without sacrificing model accuracy.

Key Takeaways

  • Eliminates the primary bottleneck in reinforcement learning (RL) training by accelerating rollout generation by up to 1.8x at the 8B scale.
  • Provides a lossless optimization method that maintains training fidelity, unlike traditional techniques that trade accuracy for speed.
  • Enables significant compute efficiency for large-scale models, with projected 2.5x end-to-end speedups for 235B parameter models.

NVIDIA researchers have introduced a method to integrate speculative decoding directly into the NeMo RL training loop, offering a solution to the significant latency bottlenecks inherent in reinforcement learning post-training. By utilizing a two-path architecture that incorporates the EAGLE-3 drafting framework, the team has achieved lossless rollout acceleration, meaning the target model’s output distribution remains identical to standard autoregressive generation. This development is now available in NeMo RL v0.6.0, providing a way to accelerate training for complex tasks like math reasoning and code generation without sacrificing training fidelity.

Addressing the Rollout Bottleneck

In synchronous reinforcement learning workflows, rollout generation is the primary performance constraint, typically consuming 65% to 72% of the total step time. Because log-probability recomputation and policy optimization account for a much smaller fraction of the process, accelerating generation is essential for improving overall training throughput. Unlike other methods such as off-policy replay or low-precision rollouts, which may compromise training quality, speculative decoding uses a smaller draft model to propose tokens that a larger target model verifies, ensuring the final output remains mathematically consistent with the target policy.

Performance Gains at Scale

Testing on 32 GB200 GPUs at the 8B model scale demonstrated significant efficiency improvements. On the RL-Zero workload, the integration of EAGLE-3 reduced generation latency from 100 seconds to 56.6 seconds, representing a 1.8× speedup. These generation-side gains translated to an overall step speedup of 1.41×. While performance gains vary based on the complexity of the reasoning task, the research confirms that speculative decoding and asynchronous execution are complementary, as the former reduces the cost of individual rollouts while the latter hides remaining generation time behind other training computations.

Optimizing Configuration for Training

The research highlights that achieving these speedups requires careful configuration, particularly regarding draft initialization and length. Researchers found that draft models initialized on in-domain datasets, such as DAPO, significantly outperform those initialized on general-purpose chat datasets. Furthermore, while increasing draft length can improve acceptance rates, it often leads to diminishing returns or performance degradation in complex reasoning tasks. For most workloads, a draft length of k=3 serves as a reliable default, as longer settings can introduce overhead that offsets the benefits of speculation.

Future Projections for Large Models

Using a proprietary GPU performance simulator, the research team projected the impact of this technique on larger models, specifically the Qwen3-235B-A22B. At this scale, running on 2048 GB200 GPUs with asynchronous RL, the system is projected to achieve a 3.5× rollout speedup and a 2.5× end-to-end training speedup. These findings suggest that as models grow in size, the integration of speculative decoding will become increasingly vital for maintaining efficient training cycles in high-compute environments.