Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
This paper addresses a critical failure in current AI training methods. While "On-Policy Self-Distillation" (OPSD) is a popular technique for teaching smaller models to reason by having them learn from a "privileged" teacher that knows the correct answer, it often backfires when applied to models designed for long, reflective chains of thought. Instead of becoming better thinkers, these models often lose their ability to reason independently, leading to performance stagnation or decline. The authors identify that this happens because the teacher model forces the student to memorize the specific path to the answer rather than teaching it how to solve problems generally.
The Problem: Rote Memorization
The researchers discovered that when a teacher model has access to a reference solution, its feedback is dominated by "reference-induced" information. This means the teacher is essentially telling the student, "Here is the exact sequence of words to reach this specific answer," rather than providing helpful guidance on the reasoning process itself. This causes the student model to abandon its natural, reflective thinking style—often marked by tokens like "Wait" or "Let me think"—in favor of blindly copying the reference. The team’s analysis shows that this reference-based signal is so strong that it actually opposes the helpful, transferable reasoning corrections the student needs to learn.
The Solution: Purifying the Signal
To fix this, the authors propose a two-step "purification" process. First, they create a "reference-only" teacher—a model that sees the answer but not the original question. By comparing this to the standard teacher, they can isolate and subtract the non-transferable, memorization-heavy signal. What remains is a "residual" that contains only the question-conditioned, transferable reasoning corrections.
Second, they use a mathematical tool called Pointwise Mutual Information (PMI) to turn this residual into a clean target distribution. This allows the student model to learn only the useful, generalizable reasoning steps while remaining anchored to its original, healthy base model.
Results and Impact
The researchers tested this approach across four different long-chain-of-thought models and two datasets. Their experiments demonstrate that this "Purified OPSD" method consistently outperforms both the base models and standard OPSD. Most importantly, the new method preserves the models' natural epistemic behavior—the reflective, self-correcting "thinking" process that is essential for complex problem-solving. By filtering out the noise of rote memorization, the student models are able to improve their reasoning capabilities without losing the very traits that make them effective thinkers.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!