Back to AI Research

AI Research

JURY-RL: Votes Propose, Proofs Dispose for Label-Fr... | AI Research

Key Takeaways

  • JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR Reinforcement learning with verifiable rewards (RLVR) is a powerful way to improve the reasoning c...
  • Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications.
  • In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training.
  • Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean.
  • When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers.
Paper AbstractExpand

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Reinforcement learning with verifiable rewards (RLVR) is a powerful way to improve the reasoning capabilities of large language models by training them to produce answers that can be checked for correctness. However, this process usually requires expensive human-annotated labels or carefully crafted reward rules. JURY-RL is a new framework that enables label-free training by separating the process of suggesting an answer from the process of verifying it. By combining majority voting with formal mathematical verification, the system ensures that models are rewarded for provable truth rather than just popular consensus or potentially biased judgments.

The "Votes Propose, Proofs Dispose" Workflow

The core of JURY-RL is a two-stage process that balances scalability with accuracy. First, the model generates multiple potential answers for a given problem. A "jury"—a simple majority vote—selects the most common answer as a candidate. Second, a formal theorem prover (specifically Lean) acts as a judge to verify this candidate. If the verifier confirms the answer is correct, the model receives a positive reward. This ensures that the training signal is anchored in objective, verifiable truth, which prevents the model from being rewarded for "hallucinated" but popular answers.

Handling Inconclusive Results with ResZero

A major challenge in label-free training is what to do when the verifier cannot reach a definitive conclusion. Simply rewarding the majority vote in these cases can lead to "reward hacking," where the model learns to prioritize consensus over correctness. To solve this, the researchers introduced ResZero (Residual-Zero). When verification is inconclusive, the system discards the majority proposal and instead assigns a zero-mean, variance-preserving reward to the remaining "residual" answers. This approach keeps the training process stable and prevents the model from collapsing into a state where it only reinforces unverified, potentially incorrect patterns.

Performance and Generalization

JURY-RL demonstrates that it is possible to achieve high-quality reasoning without human labels. Across various mathematical and coding benchmarks, the framework consistently outperforms other label-free methods. Notably, it achieves pass@1 performance—the likelihood that the first generated answer is correct—that is comparable to models trained with supervised ground-truth data. Furthermore, the framework shows superior generalization, often outperforming supervised baselines in metrics like pass@k and the diversity of the solutions produced, suggesting that the model learns more robust reasoning strategies rather than just memorizing specific answer formats.

Key Advantages for Training Stability

By decoupling the proposal of an answer from the final reward, JURY-RL addresses the common pitfalls of existing self-rewarding methods. Traditional approaches often suffer from "entropy collapse" or training instability when they rely solely on model-internal confidence or simple agreement. JURY-RL’s design—specifically the use of the ResZero fallback—ensures that the model continues to learn even when formal verification is not possible. This makes the framework a more reliable and scalable alternative for training models in complex, machine-checkable domains like mathematics and programming.

Comments (0)

No comments yet

Be the first to share your thoughts!