JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Reinforcement learning with verifiable rewards (RLVR) is a powerful way to improve the reasoning capabilities of large language models by training them to produce answers that can be checked for correctness. However, this process usually requires expensive human-annotated labels or carefully crafted reward rules. JURY-RL is a new framework that enables label-free training by separating the process of suggesting an answer from the process of verifying it. By combining majority voting with formal mathematical verification, the system ensures that models are rewarded for provable truth rather than just popular consensus or potentially biased judgments.
The "Votes Propose, Proofs Dispose" Workflow
The core of JURY-RL is a two-stage process that balances scalability with accuracy. First, the model generates multiple potential answers for a given problem. A "jury"—a simple majority vote—selects the most common answer as a candidate. Second, a formal theorem prover (specifically Lean) acts as a judge to verify this candidate. If the verifier confirms the answer is correct, the model receives a positive reward. This ensures that the training signal is anchored in objective, verifiable truth, which prevents the model from being rewarded for "hallucinated" but popular answers.
Handling Inconclusive Results with ResZero
A major challenge in label-free training is what to do when the verifier cannot reach a definitive conclusion. Simply rewarding the majority vote in these cases can lead to "reward hacking," where the model learns to prioritize consensus over correctness. To solve this, the researchers introduced ResZero (Residual-Zero). When verification is inconclusive, the system discards the majority proposal and instead assigns a zero-mean, variance-preserving reward to the remaining "residual" answers. This approach keeps the training process stable and prevents the model from collapsing into a state where it only reinforces unverified, potentially incorrect patterns.
Performance and Generalization
JURY-RL demonstrates that it is possible to achieve high-quality reasoning without human labels. Across various mathematical and coding benchmarks, the framework consistently outperforms other label-free methods. Notably, it achieves pass@1 performance—the likelihood that the first generated answer is correct—that is comparable to models trained with supervised ground-truth data. Furthermore, the framework shows superior generalization, often outperforming supervised baselines in metrics like pass@k and the diversity of the solutions produced, suggesting that the model learns more robust reasoning strategies rather than just memorizing specific answer formats.
Key Advantages for Training Stability
By decoupling the proposal of an answer from the final reward, JURY-RL addresses the common pitfalls of existing self-rewarding methods. Traditional approaches often suffer from "entropy collapse" or training instability when they rely solely on model-internal confidence or simple agreement. JURY-RL’s design—specifically the use of the ResZero fallback—ensures that the model continues to learn even when formal verification is not possible. This makes the framework a more reliable and scalable alternative for training models in complex, machine-checkable domains like mathematics and programming.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!