Back to AI Research

AI Research

Rubric-Grounded RL: Structured Judge Rewards for Ge... | AI Research

Key Takeaways

  • Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning This paper introduces "rubric-grounded reinforcement learning," a new framework for...
  • With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation.
  • The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond.
  • These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.
  • Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
Paper AbstractExpand

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
This paper introduces "rubric-grounded reinforcement learning," a new framework for training large language models (LLMs) to reason more effectively. Instead of relying on a single, broad score to judge if an answer is good, the researchers break down the evaluation process into a checklist of specific, weighted criteria. By using a frozen LLM judge to grade responses against these detailed rubrics, the model receives "partial credit" for its answers. This approach allows the model to learn from nuanced feedback, helping it improve its reasoning capabilities even on tasks it has never seen before.

How the Approach Works

The core of this method is a "privileged" training setup. During training, the model is asked to answer a question without seeing the source document. However, an external "judge" model is given both the question and the original source document to verify the answer. The judge evaluates the response based on a structured rubric—a list of requirements like technical accuracy, use of specific terminology, and logical flow. Because the policy model is optimized to satisfy these criteria without having direct access to the source text, it learns to internalize high-quality reasoning patterns rather than simply copying information.

The Power of Partial Credit

Traditional reinforcement learning often compresses the quality of an answer into one scalar value, which can be imprecise. By decomposing the reward into multiple, weighted criteria, the researchers create a more informative signal for the model. This provides a "resolution" advantage: the model can be rewarded for getting parts of a complex problem right even if the final answer is not perfect. This structured feedback helps the model distinguish between "bad," "partially correct," and "nearly complete" answers, leading to more stable and effective learning.

Key Results and Generalization

The researchers tested this framework by training a Llama-3.1-8B-Instruct model using roughly 100,000 scientific and technical documents from the Office of Scientific and Technical Information (OSTI). The model achieved a 71.7% normalized reward on held-out rubric evaluations. More importantly, the model showed improved performance on four external reasoning benchmarks—GSM8K, MATH, GPQA Main, and GPQA Diamond—that were not part of the training corpus. These results suggest that training on structured, document-grounded rubrics helps the model develop transferable reasoning skills that apply to a wide range of tasks beyond the original training data.

Important Considerations

While this framework is powerful, it relies on the quality of the rubrics and the reliability of the judge model. The researchers note that while the judge is used repeatedly during training, the rubrics themselves are created offline, which helps manage computational costs. The framework is designed to be domain-agnostic, meaning it could theoretically be applied to other areas like legal drafting, clinical summarization, or code review, provided that the quality of the work can be broken down into a verifiable checklist.

Comments (0)

No comments yet

Be the first to share your thoughts!