Back to AI Research

AI Research

Process-Verified Reinforcement Learning for Theorem... | AI Research

Key Takeaways

  • Process-Verified Reinforcement Learning for Theorem Proving via Lean This research introduces a new way to train AI models for mathematical theorem proving b...
  • While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback.
  • This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound.
  • In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training.
  • Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory.
Paper AbstractExpand

While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

Process-Verified Reinforcement Learning for Theorem Proving via Lean
This research introduces a new way to train AI models for mathematical theorem proving by using the Lean proof assistant as a "process oracle." While traditional reinforcement learning for AI reasoning often relies on a simple "pass/fail" signal at the very end of a task, this approach uses the fine-grained, step-by-step feedback provided by Lean to guide the model during training. By treating the proof assistant as an active supervisor rather than just a final judge, the researchers enable the model to learn from its specific mistakes and successes at every stage of the reasoning process.

Turning Lean into a Teacher

In formal theorem proving, a proof is a sequence of individual steps called "tactics." The Lean proof assistant can verify whether each of these tactics is logically sound according to type theory. The researchers developed a method to parse these proofs and extract structured feedback from Lean’s internal logs. If a model generates a proof, Lean identifies exactly which tactic failed first. This allows the training framework to distinguish between a "locally sound" step—which is correct but perhaps incomplete—and an "erroneous" step that invalidates the entire proof.

Structured Credit Assignment

A major challenge in training language models is "credit assignment"—determining which specific tokens in a long sequence are responsible for a success or failure. To solve this, the authors introduced a "first-error propagation" rule. If a model makes a mistake at a specific step, the system marks that step and all subsequent steps as invalid. This ensures that the model is penalized for the exact point where its reasoning broke down. These tactic-level signals are then converted into numerical scores and integrated into a reinforcement learning objective, allowing the model to receive dense, meaningful feedback throughout the entire generation process.

Empirical Performance

The researchers tested this approach by applying it to existing models like STP-Lean and DeepSeek-Prover-V1.5. By incorporating tactic-level supervision, the models showed consistent performance improvements on established benchmarks such as MiniF2F and ProofNet. The results indicate that providing the model with fine-grained, verifier-grounded feedback is more effective than relying on outcome-only rewards. This method achieves these gains without needing human-labeled data or external, natural-language-based reward models, demonstrating that symbolic proof assistants can be powerful tools for improving AI reasoning reliability.

A New Perspective on Formal Reasoning

Beyond the specific performance gains, this study highlights a shift in how we view symbolic tools in AI development. Rather than using proof assistants only to check the final answer at the end of a search, this framework treats them as integral components of the learning loop. By bridging the gap between the symbolic, rule-based nature of Lean and the probabilistic, autoregressive nature of language models, this work provides a scalable path toward building AI systems that combine the flexibility of large language models with the rigorous reliability of formal logic.

Comments (0)

No comments yet

Be the first to share your thoughts!