The Role of Feedback Alignment in Self-Distillation explores how to improve language models by refining the "context" they receive during training. Self-distillation is a technique where a model acts as both a student (which sees only a question) and a teacher (which sees the question plus extra information). By training the student to match the teacher’s output, the model learns to solve problems more effectively even when that extra context is removed. This paper investigates how the design of that context—specifically the feedback provided to the teacher—determines how much the model actually learns.
Improving Context Design
The researchers compared three ways to provide feedback to the model: a simple binary reward (did the model get the answer right?), a full reference solution (the "correct" way to solve the problem), and a step-by-step critique aligned to the model's own reasoning. They found that the structure of this feedback is critical. While a reference solution provides the correct answer, it often uses different phrasing or logic than the model’s own attempt, which can confuse the training process by penalizing the model for stylistic differences rather than actual errors.
The Power of Step-Aligned Feedback
The study demonstrates that "step-aligned" feedback—where the critic corrects only the specific steps where the model goes wrong while keeping the correct steps intact—is significantly more effective. This method acts as a form of "process supervision." By targeting only the errors, the model receives a clear signal on where to improve without being pressured to change its entire approach. This approach outperformed the binary reward method by 16.11 points and the reference solution method by 5.27 points on key accuracy metrics.
Why Alignment Matters
The effectiveness of this feedback relies on a phenomenon related to how models process information in context. If the feedback includes too much of the model's original incorrect reasoning, the model may simply "copy" its own mistakes. Conversely, if the feedback ignores the model's correct steps, the model may lose its ability to perform those steps correctly. The researchers discovered that the best results occur when the feedback repeats the model's correct reasoning verbatim and only provides corrections at the exact point of failure. This allows the model to maintain its successful reasoning patterns while learning to fix specific logical gaps.
Key Takeaways
The findings suggest that the quality of feedback is not just about being "correct," but about being structurally aligned with the model's own reasoning trace. By using step-aligned critiques, researchers can achieve the benefits of process-level supervision without the need for expensive, specialized reward models. This research highlights that how we present information to a model during training is just as important as the information itself, providing a more efficient path to improving reasoning capabilities in language models.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!