Back to AI Research

AI Research

Purified OPSD: On-Policy Self-Distillation Without... | AI Research

Key Takeaways

  • Purified OPSD: On-Policy Self-Distillation Without Losing How to Think This paper addresses a critical failure in current AI training methods.
  • However, we find that OPSD consistently fails on long chain-of-thought (long-CoT) reasoning models, yielding at best marginal gains while destabilizing the reflective reasoning capability these models depend on.
  • Based on this diagnosis, we propose a two-step solution.
  • Experiments on four long-CoT models across two datasets demonstrate consistent improvements over both the base model and standard OPSD, while preserving the models' natural epistemic behavior throughout training.
  • Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
Paper AbstractExpand

On-policy self-distillation (OPSD) has emerged as a promising paradigm for improving LLM reasoning, where a privileged teacher with access to reference solutions provides token-level supervision on the student's own generated trajectories. However, we find that OPSD consistently fails on long chain-of-thought (long-CoT) reasoning models, yielding at best marginal gains while destabilizing the reflective reasoning capability these models depend on. Through a novel decomposition of the teacher's supervision signal, we identify the root cause: the teacher's supervision is dominated by a reference-induced component that drives rote memorization of reference-specific shortcuts, while the question-conditioned, inference-transferable component is ignored or actively opposed. Based on this diagnosis, we propose a two-step solution. First, we construct a reference-only teacher (the same model conditioned on the reference without the question) to isolate the non-transferable component of the supervision signal; the residual after subtracting this component captures the question-conditioned, inference-transferable correction. Second, we use pointwise mutual information (PMI) as the mechanism to transform this residual into a well-formed PMI target distribution that the student can directly distill from, filtering out the reference-induced shortcut. Experiments on four long-CoT models across two datasets demonstrate consistent improvements over both the base model and standard OPSD, while preserving the models' natural epistemic behavior throughout training.

Purified OPSD: On-Policy Self-Distillation Without Losing How to Think
This paper addresses a critical failure in current AI training methods. While "On-Policy Self-Distillation" (OPSD) is a popular technique for teaching smaller models to reason by having them learn from a "privileged" teacher that knows the correct answer, it often backfires when applied to models designed for long, reflective chains of thought. Instead of becoming better thinkers, these models often lose their ability to reason independently, leading to performance stagnation or decline. The authors identify that this happens because the teacher model forces the student to memorize the specific path to the answer rather than teaching it how to solve problems generally.

The Problem: Rote Memorization

The researchers discovered that when a teacher model has access to a reference solution, its feedback is dominated by "reference-induced" information. This means the teacher is essentially telling the student, "Here is the exact sequence of words to reach this specific answer," rather than providing helpful guidance on the reasoning process itself. This causes the student model to abandon its natural, reflective thinking style—often marked by tokens like "Wait" or "Let me think"—in favor of blindly copying the reference. The team’s analysis shows that this reference-based signal is so strong that it actually opposes the helpful, transferable reasoning corrections the student needs to learn.

The Solution: Purifying the Signal

To fix this, the authors propose a two-step "purification" process. First, they create a "reference-only" teacher—a model that sees the answer but not the original question. By comparing this to the standard teacher, they can isolate and subtract the non-transferable, memorization-heavy signal. What remains is a "residual" that contains only the question-conditioned, transferable reasoning corrections.
Second, they use a mathematical tool called Pointwise Mutual Information (PMI) to turn this residual into a clean target distribution. This allows the student model to learn only the useful, generalizable reasoning steps while remaining anchored to its original, healthy base model.

Results and Impact

The researchers tested this approach across four different long-chain-of-thought models and two datasets. Their experiments demonstrate that this "Purified OPSD" method consistently outperforms both the base models and standard OPSD. Most importantly, the new method preserves the models' natural epistemic behavior—the reflective, self-correcting "thinking" process that is essential for complex problem-solving. By filtering out the noise of rote memorization, the student models are able to improve their reasoning capabilities without losing the very traits that make them effective thinkers.

Comments (0)

No comments yet

Be the first to share your thoughts!