Back to AI Research

AI Research

Rethinking Reward Supervision: Rubric-Conditioned S... | AI Research

Key Takeaways

  • Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation This paper introduces Rubric-Conditioned Self-Distillation (RCSD), a new framework design...
  • Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards.
  • Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved.
  • We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation.
  • Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories.
Paper AbstractExpand

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
This paper introduces Rubric-Conditioned Self-Distillation (RCSD), a new framework designed to improve how reasoning language models are trained. Currently, models are often trained using reinforcement learning with simple "pass/fail" rewards or by imitating a single "gold standard" reasoning path. The authors argue that these methods are limited: scalar rewards are too vague to explain why a model failed, and imitating a single path can be overly restrictive. RCSD instead uses structured rubrics—sets of specific criteria—to provide dense, token-level guidance that helps the model learn the underlying principles of a high-quality response rather than just memorizing a single answer.

A New Way to Guide Learning

The core innovation of RCSD is moving away from compressing feedback into a single number or a single reference trajectory. Instead, the framework treats rubrics as a "privileged" source of information. During training, a teacher model is given a rubric that outlines what a strong response should look like. This teacher then provides step-by-step, token-level feedback to a student model as it generates its own reasoning. This allows the student to receive nuanced, criterion-aware guidance that highlights exactly where it is succeeding or failing in real-time.

The Two-Stage Pipeline

To make this approach practical, the authors divide the training process into two distinct stages. In the first stage, they train a "rubric generator" that learns to create task-specific evaluation criteria for any given question. This allows the model to generate its own rubrics without needing human-written ones at test time. In the second stage, the reasoner uses these generated rubrics to guide its own learning. By separating the creation of criteria from the reasoning process, the model becomes more flexible and capable of handling complex tasks where there isn't just one "correct" way to think through a problem.

Performance and Results

The researchers tested RCSD across a variety of science reasoning benchmarks, including both verifiable tasks (like math or code) and open-ended, non-verifiable tasks. The results show that RCSD outperforms existing methods, including Group Relative Policy Optimization (GRPO) and standard On-Policy Self-Distillation (OPSD). On average, the method achieved a higher score than these baselines, with particularly strong performance in scientific reasoning tasks where simple outcome-based rewards often fail to capture the quality of the reasoning process.

Why This Matters

By using rubrics as a structured interface for supervision, the authors demonstrate that models can learn more effectively when they are taught the "what" and "why" of a good answer rather than just the "how." This approach helps solve the credit-assignment problem in machine learning, where it is often difficult to tell which specific part of a long reasoning chain led to an error. Because RCSD remains on-policy—meaning it learns from the model's own attempts rather than just following a fixed, pre-written path—it encourages the model to explore multiple valid ways to solve a problem while still adhering to the necessary quality criteria.

Comments (0)

No comments yet

Be the first to share your thoughts!