Back to AI Research

AI Research

The Role of Feedback Alignment in Self-Distillation | AI Research

Key Takeaways

  • The Role of Feedback Alignment in Self-Distillation explores how to improve language models by refining the "context" they receive during training.
  • Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response.
  • Self-distillation trains the model to retain this improvement when the context is not present.
  • The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context.
  • What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored.
Paper AbstractExpand

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.

The Role of Feedback Alignment in Self-Distillation explores how to improve language models by refining the "context" they receive during training. Self-distillation is a technique where a model acts as both a student (which sees only a question) and a teacher (which sees the question plus extra information). By training the student to match the teacher’s output, the model learns to solve problems more effectively even when that extra context is removed. This paper investigates how the design of that context—specifically the feedback provided to the teacher—determines how much the model actually learns.

Improving Context Design

The researchers compared three ways to provide feedback to the model: a simple binary reward (did the model get the answer right?), a full reference solution (the "correct" way to solve the problem), and a step-by-step critique aligned to the model's own reasoning. They found that the structure of this feedback is critical. While a reference solution provides the correct answer, it often uses different phrasing or logic than the model’s own attempt, which can confuse the training process by penalizing the model for stylistic differences rather than actual errors.

The Power of Step-Aligned Feedback

The study demonstrates that "step-aligned" feedback—where the critic corrects only the specific steps where the model goes wrong while keeping the correct steps intact—is significantly more effective. This method acts as a form of "process supervision." By targeting only the errors, the model receives a clear signal on where to improve without being pressured to change its entire approach. This approach outperformed the binary reward method by 16.11 points and the reference solution method by 5.27 points on key accuracy metrics.

Why Alignment Matters

The effectiveness of this feedback relies on a phenomenon related to how models process information in context. If the feedback includes too much of the model's original incorrect reasoning, the model may simply "copy" its own mistakes. Conversely, if the feedback ignores the model's correct steps, the model may lose its ability to perform those steps correctly. The researchers discovered that the best results occur when the feedback repeats the model's correct reasoning verbatim and only provides corrections at the exact point of failure. This allows the model to maintain its successful reasoning patterns while learning to fix specific logical gaps.

Key Takeaways

The findings suggest that the quality of feedback is not just about being "correct," but about being structurally aligned with the model's own reasoning trace. By using step-aligned critiques, researchers can achieve the benefits of process-level supervision without the need for expensive, specialized reward models. This research highlights that how we present information to a model during training is just as important as the information itself, providing a more efficient path to improving reasoning capabilities in language models.

Comments (0)

No comments yet

Be the first to share your thoughts!