DOPD: Dual On-policy Distillation introduces a new way to train smaller AI models by transferring knowledge from larger, more capable "teacher" models. While traditional distillation methods often struggle when they try to use "privileged information"—extra data like reasoning hints or visual annotations—this paper identifies a specific failure mode called "privilege illusion." This occurs when a student model learns to mimic the teacher's reliance on extra information rather than actually acquiring the underlying reasoning skills. DOPD solves this by dynamically routing supervision to ensure the student learns genuine capabilities instead of just taking shortcuts.
The Problem: Privilege Illusion
When researchers provide extra, privileged information to a teacher model to improve its performance, they often hope the student will learn from that improved output. However, the authors found that this creates a "privilege illusion." The student model becomes confused by two different gaps: the actual difference in skill between the teacher and student, and the "information asymmetry" caused by the extra data. Because the student cannot access the same privileged context in the same way, it often fails to learn the core task, leading to unstable training and poor performance.
How DOPD Works
DOPD uses an "advantage-aware" approach to manage how the student learns. Instead of treating every token (the individual units of text or data the model generates) as equally important, the system calculates a "privilege advantage gap." This metric measures the difference in confidence between the teacher and the student when both are given the same privileged information.
If the gap is large, it suggests the teacher has a genuine skill advantage, so the system applies stronger supervision to help the student learn that skill. If the gap is small, it suggests the teacher's advantage is likely just due to the extra information, so the system uses lighter supervision to keep the training stable and encourage the student to explore on its own.
Results and Performance
The researchers tested DOPD across both Large Language Models (LLMs) and Vision-Language Models (VLMs). The results showed that DOPD consistently outperformed standard on-policy distillation methods. On average, it improved performance by 7.5 points on LLM benchmarks and 6.0 points on VLM benchmarks. Beyond raw accuracy, the model demonstrated better stability during training, improved robustness, and superior performance in continual learning tasks, proving that it is more effective at transferring real-world capabilities than previous methods.
Key Takeaways
The core insight of this research is that simply adding more data or "privileged" context to a teacher model is not enough to guarantee a better student model. Without a way to distinguish between a teacher's actual competence and its reliance on extra information, students will likely fall into the trap of learning "shortcuts." By dynamically adjusting the supervision strategy based on the privilege advantage gap, DOPD allows for a more efficient and reliable way to distill complex AI models into smaller, more practical versions.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!