NVIDIA Releases Polar: Streamlining RL for Coding Agents

Key Takeaways

  • Eliminates the need to rewrite complex agent code, allowing developers to apply reinforcement learning to existing tools like Claude Code and Codex without modification.
  • Introduces 'prefix-merging' trajectory reconstruction, which delivers a 5.39x speedup in training efficiency compared to traditional per-request methods.
  • Provides a versatile, open-source solution for both online RL training and high-quality offline SFT data generation.

NVIDIA has introduced Polar, a new rollout framework designed to streamline reinforcement learning for language agents by enabling training across existing coding harnesses without requiring modifications to their underlying code. By acting as a proxy at the model API boundary, Polar allows researchers to connect complex agent software—such as Codex CLI, Claude Code, Qwen Code, and Pi—directly to training pipelines. This approach addresses the traditional engineering challenge of integrating agent harnesses, which often require extensive rewrites to function within standard reinforcement learning environments.

Simplifying Agent Integration

The core innovation of Polar is its ability to intercept model calls at the API level. Instead of forcing developers to rewrite harness logic to fit specific framework APIs, Polar places a gateway proxy between the agent and the model. This proxy detects the provider API, normalizes requests into a unified format, and captures essential token-level data, including prompt IDs, sampled response tokens, and log probabilities. By simply pointing a harness’s model base URL to the Polar gateway, researchers can capture execution details that are otherwise lost in conventional integration methods.
The framework architecture consists of a rollout server and gateway nodes. The rollout server manages task requests and distributes sessions, while gateway nodes oversee the lifecycle of each session, from runtime initialization to trajectory building and evaluation. This design ensures that CPU-heavy tasks, such as runtime preparation, occur off the critical path, preventing bottlenecks during GPU-bound agent execution.

Optimizing Trajectory Reconstruction

Polar offers two primary strategies for reconstructing trainable trajectories: per-request and prefix-merging. While the per-request builder treats every model call as an independent trace, it can fragment multi-turn sessions. In contrast, the prefix-merging builder reconstructs longer, more coherent traces by partitioning completions into ordered chains based on token-prefix relations. Experiments demonstrate that prefix-merging significantly enhances efficiency, reducing the number of trainer updates from 1,185 to 218 and providing a 5.39x speedup in wall-clock time compared to the per-request method.

Performance and Application

When applied to the Qwen3.5-4B model using Group Relative Policy Optimization (GRPO), Polar demonstrated significant performance gains on the SWE-Bench Verified benchmark. The framework achieved a 22.6-point improvement on the Codex harness, proving its effectiveness in aligning models with unfamiliar action protocols. Even on well-aligned harnesses like Qwen Code, Polar delivered measurable gains.
Beyond online reinforcement learning, Polar functions as a distributed offline data generation service. In testing, the framework generated 504 accepted SFT trajectories from 1,638 attempts, maintaining a 30.8% acceptance rate. By enabling researchers to capture high-quality, token-faithful data directly from native coding environments, Polar provides a versatile tool for both training and data collection, now released as open source under NeMo Gym.

Comments (0)

No comments yet

Be the first to share your thoughts!