Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
This paper introduces Rubric-Conditioned Self-Distillation (RCSD), a new framework designed to improve how reasoning language models are trained. Currently, models are often trained using reinforcement learning with simple "pass/fail" rewards or by imitating a single "gold standard" reasoning path. The authors argue that these methods are limited: scalar rewards are too vague to explain why a model failed, and imitating a single path can be overly restrictive. RCSD instead uses structured rubrics—sets of specific criteria—to provide dense, token-level guidance that helps the model learn the underlying principles of a high-quality response rather than just memorizing a single answer.
A New Way to Guide Learning
The core innovation of RCSD is moving away from compressing feedback into a single number or a single reference trajectory. Instead, the framework treats rubrics as a "privileged" source of information. During training, a teacher model is given a rubric that outlines what a strong response should look like. This teacher then provides step-by-step, token-level feedback to a student model as it generates its own reasoning. This allows the student to receive nuanced, criterion-aware guidance that highlights exactly where it is succeeding or failing in real-time.
The Two-Stage Pipeline
To make this approach practical, the authors divide the training process into two distinct stages. In the first stage, they train a "rubric generator" that learns to create task-specific evaluation criteria for any given question. This allows the model to generate its own rubrics without needing human-written ones at test time. In the second stage, the reasoner uses these generated rubrics to guide its own learning. By separating the creation of criteria from the reasoning process, the model becomes more flexible and capable of handling complex tasks where there isn't just one "correct" way to think through a problem.
Performance and Results
The researchers tested RCSD across a variety of science reasoning benchmarks, including both verifiable tasks (like math or code) and open-ended, non-verifiable tasks. The results show that RCSD outperforms existing methods, including Group Relative Policy Optimization (GRPO) and standard On-Policy Self-Distillation (OPSD). On average, the method achieved a higher score than these baselines, with particularly strong performance in scientific reasoning tasks where simple outcome-based rewards often fail to capture the quality of the reasoning process.
Why This Matters
By using rubrics as a structured interface for supervision, the authors demonstrate that models can learn more effectively when they are taught the "what" and "why" of a good answer rather than just the "how." This approach helps solve the credit-assignment problem in machine learning, where it is often difficult to tell which specific part of a long reasoning chain led to an error. Because RCSD remains on-policy—meaning it learns from the model's own attempts rather than just following a fixed, pre-written path—it encourages the model to explore multiple valid ways to solve a problem while still adhering to the necessary quality criteria.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!