Process-Verified Reinforcement Learning for Theorem Proving via Lean
This research introduces a new way to train AI models for mathematical theorem proving by using the Lean proof assistant as a "process oracle." While traditional reinforcement learning for AI reasoning often relies on a simple "pass/fail" signal at the very end of a task, this approach uses the fine-grained, step-by-step feedback provided by Lean to guide the model during training. By treating the proof assistant as an active supervisor rather than just a final judge, the researchers enable the model to learn from its specific mistakes and successes at every stage of the reasoning process.
Turning Lean into a Teacher
In formal theorem proving, a proof is a sequence of individual steps called "tactics." The Lean proof assistant can verify whether each of these tactics is logically sound according to type theory. The researchers developed a method to parse these proofs and extract structured feedback from Lean’s internal logs. If a model generates a proof, Lean identifies exactly which tactic failed first. This allows the training framework to distinguish between a "locally sound" step—which is correct but perhaps incomplete—and an "erroneous" step that invalidates the entire proof.
Structured Credit Assignment
A major challenge in training language models is "credit assignment"—determining which specific tokens in a long sequence are responsible for a success or failure. To solve this, the authors introduced a "first-error propagation" rule. If a model makes a mistake at a specific step, the system marks that step and all subsequent steps as invalid. This ensures that the model is penalized for the exact point where its reasoning broke down. These tactic-level signals are then converted into numerical scores and integrated into a reinforcement learning objective, allowing the model to receive dense, meaningful feedback throughout the entire generation process.
Empirical Performance
The researchers tested this approach by applying it to existing models like STP-Lean and DeepSeek-Prover-V1.5. By incorporating tactic-level supervision, the models showed consistent performance improvements on established benchmarks such as MiniF2F and ProofNet. The results indicate that providing the model with fine-grained, verifier-grounded feedback is more effective than relying on outcome-only rewards. This method achieves these gains without needing human-labeled data or external, natural-language-based reward models, demonstrating that symbolic proof assistants can be powerful tools for improving AI reasoning reliability.
A New Perspective on Formal Reasoning
Beyond the specific performance gains, this study highlights a shift in how we view symbolic tools in AI development. Rather than using proof assistants only to check the final answer at the end of a search, this framework treats them as integral components of the learning loop. By bridging the gap between the symbolic, rule-based nature of Lean and the probabilistic, autoregressive nature of language models, this work provides a scalable path toward building AI systems that combine the flexibility of large language models with the rigorous reliability of formal logic.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!