DOPD: Dual On-policy Distillation

DOPD: Dual On-policy Distillation | AI Research

Key Takeaways

DOPD: Dual On-policy Distillation introduces a new way to train smaller AI models by transferring knowledge from larger, more capable "teacher" models.
On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals.
To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself.
This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals.
Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts.

Paper AbstractExpand

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.

DOPD: Dual On-policy Distillation introduces a new way to train smaller AI models by transferring knowledge from larger, more capable "teacher" models. While traditional distillation methods often struggle when they try to use "privileged information"—extra data like reasoning hints or visual annotations—this paper identifies a specific failure mode called "privilege illusion." This occurs when a student model learns to mimic the teacher's reliance on extra information rather than actually acquiring the underlying reasoning skills. DOPD solves this by dynamically routing supervision to ensure the student learns genuine capabilities instead of just taking shortcuts.

The Problem: Privilege Illusion

When researchers provide extra, privileged information to a teacher model to improve its performance, they often hope the student will learn from that improved output. However, the authors found that this creates a "privilege illusion." The student model becomes confused by two different gaps: the actual difference in skill between the teacher and student, and the "information asymmetry" caused by the extra data. Because the student cannot access the same privileged context in the same way, it often fails to learn the core task, leading to unstable training and poor performance.

How DOPD Works

DOPD uses an "advantage-aware" approach to manage how the student learns. Instead of treating every token (the individual units of text or data the model generates) as equally important, the system calculates a "privilege advantage gap." This metric measures the difference in confidence between the teacher and the student when both are given the same privileged information.
If the gap is large, it suggests the teacher has a genuine skill advantage, so the system applies stronger supervision to help the student learn that skill. If the gap is small, it suggests the teacher's advantage is likely just due to the extra information, so the system uses lighter supervision to keep the training stable and encourage the student to explore on its own.

Results and Performance

The researchers tested DOPD across both Large Language Models (LLMs) and Vision-Language Models (VLMs). The results showed that DOPD consistently outperformed standard on-policy distillation methods. On average, it improved performance by 7.5 points on LLM benchmarks and 6.0 points on VLM benchmarks. Beyond raw accuracy, the model demonstrated better stability during training, improved robustness, and superior performance in continual learning tasks, proving that it is more effective at transferring real-world capabilities than previous methods.

Key Takeaways

The core insight of this research is that simply adding more data or "privileged" context to a teacher model is not enough to guarantee a better student model. Without a way to distinguish between a teacher's actual competence and its reliance on extra information, students will likely fall into the trap of learning "shortcuts." By dynamically adjusting the supervision strategy based on the privilege advantage gap, DOPD allows for a more efficient and reliable way to distill complex AI models into smaller, more practical versions.

DOPD: Dual On-policy Distillation | AI Research

Key Takeaways

The Problem: Privilege Illusion

How DOPD Works

Results and Performance

Key Takeaways

Comments (0)

No comments yet