ATOD: Annealed Turn-aware On-policy Distillation fo...

ATOD: Annealed Turn-aware On-policy Distillation fo... | AI Research

Key Takeaways

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents Training small language models to act as effective agents—capable of naviga...
Training small language-model agents for long-horizon interactive tasks requires both fast imitation and reward-driven improvement.
On-policy distillation (OPD) provides dense teacher guidance and typically improves rapidly in the early stage, but its gains saturate once the student approaches the teacher, limiting the final performance ceiling.
In this paper, we propose ATOD (Annealed Turn-aware On-policy Distillation), a hybrid online distillation algorithm that explicitly exploits this complementarity.
(1) ATOD uses an annealed OPD-RL schedule: OPD dominates early training to approach teacher-level behavior, while RL is gradually strengthened to drive reward-based exploration.

Paper AbstractExpand

Training small language-model agents for long-horizon interactive tasks requires both fast imitation and reward-driven improvement. On-policy distillation (OPD) provides dense teacher guidance and typically improves rapidly in the early stage, but its gains saturate once the student approaches the teacher, limiting the final performance ceiling. Reinforcement learning (RL) directly optimizes environment rewards and encourages exploratory improvement toward a higher reward-defined ceiling, but sparse and delayed feedback makes early-stage learning much less efficient than OPD. In this paper, we propose ATOD (Annealed Turn-aware On-policy Distillation), a hybrid online distillation algorithm that explicitly exploits this complementarity. (1) ATOD uses an annealed OPD-RL schedule: OPD dominates early training to approach teacher-level behavior, while RL is gradually strengthened to drive reward-based exploration. (2) ATOD introduces Turn-level Disagreement-Uncertainty Reweighting (T-DUR), which softly amplifies high-utility turns and improves dense supervision in long trajectories. Experiments on ALFWorld, WebShop, and Search-QA show that ATOD consistently outperforms competing post-training baselines: across the three student sizes, ATOD improves average success rate by 3.03 points over OPD and 23.62 points over GRPO, while surpassing the corresponding teacher models by 2.16 points.

ATOD: Annealed Turn-aware On-policy Distillation for Multi-turn Autonomous Agents
Training small language models to act as effective agents—capable of navigating websites, solving complex search queries, or interacting with household environments—is a difficult balancing act. Developers typically choose between two methods: On-policy Distillation (OPD), which mimics a "teacher" model for fast early progress, or Reinforcement Learning (RL), which uses environment rewards to push the model to improve beyond its teacher. ATOD (Annealed Turn-aware On-policy Distillation) is a new hybrid approach that combines these two strategies to achieve faster learning and higher performance ceilings than either method could reach alone.

The Hybrid Training Schedule

The core of ATOD is an "annealed" schedule that changes the model's learning priorities over time. Early in the training process, the model relies heavily on the teacher (OPD) to learn the basics of how to interact with an environment, which prevents the "cold-start" problem often seen in RL. As training progresses, the system gradually reduces the teacher's influence and increases the weight of environment rewards (RL). This transition allows the model to first master the teacher’s behavior and then explore new strategies to surpass the teacher’s own performance.

Focusing on High-Value Decisions

In long, multi-turn tasks, not every action is equally important. Some steps are routine, while others are critical decision points that determine success or failure. ATOD introduces a mechanism called Turn-level Disagreement-Uncertainty Reweighting (T-DUR). Instead of treating every part of a conversation or task equally, T-DUR identifies which turns are most informative by measuring how much the student model disagrees with the teacher and how uncertain the student is about its own actions. By focusing supervision on these high-utility turns, the model learns more efficiently and avoids wasting training time on simple, repetitive actions.

Performance and Results

ATOD was tested on three challenging benchmarks: ALFWorld (household tasks), WebShop (web navigation), and Search-QA (complex question answering). Across various model sizes, ATOD consistently outperformed standard post-training methods like GRPO and traditional OPD. Notably, the approach allowed smaller models to reach performance levels that surpassed their own teachers. The research shows that this combination of a dynamic training schedule and turn-aware supervision is particularly effective for smaller models, helping them overcome the limitations of sparse feedback and reach higher success rates in complex, multi-step environments.

ATOD: Annealed Turn-aware On-policy Distillation fo... | AI Research

Key Takeaways

The Hybrid Training Schedule

Focusing on High-Value Decisions

Performance and Results

Comments (0)

No comments yet