Back to AI Research

AI Research

Agentic Proving for Program Verification | AI Research

Key Takeaways

  • Agentic Proving for Program Verification This paper investigates how modern "agentic" AI systems—which use planning, tool use, and iterative refinement—perfo...
  • Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics.
  • To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation.
  • Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset.
  • More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.
Paper AbstractExpand

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

Agentic Proving for Program Verification
This paper investigates how modern "agentic" AI systems—which use planning, tool use, and iterative refinement—perform at the complex task of program verification. While AI has made significant strides in solving mathematical proofs, program verification is more difficult because it requires reasoning about executable code, specific requirements, and strict logical constraints. The authors evaluate Claude Code on the CLEVER benchmark, a collection of problems designed to test whether an AI can generate correct code and prove that it meets its intended specifications within the Lean 4 theorem prover.

How the Agentic Approach Works

The researchers utilized an agentic framework powered by Claude Opus 4.6, equipped with specialized tools to interface with the Lean 4 environment. The system operates in a "compiler-in-the-loop" fashion: it initializes a project, attempts to write code or specifications, and uses the compiler's feedback to refine its work. The agents were given tasks ranging from generating formal specifications based on natural language descriptions to proving that a piece of code is mathematically equivalent to a given requirement. Crucially, the agents were also tasked with self-diagnosing their own failures, allowing them to identify whether a problem was caused by their own code or by errors within the benchmark dataset itself.

Key Findings and Performance

The results demonstrate that agentic systems are currently highly effective at foundational program verification, significantly outperforming previous benchmarks. Claude successfully generated valid specifications for 98.8% of the problems. When tasked with creating implementations that satisfy these specifications, the system achieved an 87.5% success rate using the benchmark's ground-truth requirements. Most impressively, when looking at the entire end-to-end pipeline—generating a specification, writing the code, and proving the code is correct—the system reached a 98.1% success rate on problems with consistent premises. Beyond these metrics, the model provided high-quality feedback, often correctly identifying bugs in the benchmark’s own ground-truth files.

Challenges in Evaluation

The study highlights a growing gap between the capabilities of modern AI and the design of existing verification benchmarks. A major hurdle identified is the reliance on "isomorphism-based scoring," which checks if an AI's specification is identical to a human-written one. The authors argue this is flawed because natural language is often ambiguous; there can be multiple valid ways to interpret a requirement. Because the AI often produced "arguably valid" interpretations that differed from the benchmark's single ground-truth, the researchers suggest that future evaluations need to be more flexible and resilient to these variations.

Implications for the Future

The findings suggest that the current generation of agentic provers has reached a level of maturity where they can reliably handle complex verification tasks. However, the authors note that as these tools improve, the benchmarks used to test them must also evolve. The fact that the AI was able to identify errors in the benchmark itself indicates that we are moving toward a point where AI can assist in the creation and debugging of formal verification datasets, rather than just being a student of them. The authors conclude that tight, compiler-integrated agentic paradigms are currently the most effective strategy for advancing the field of program verification.

Comments (0)

No comments yet

Be the first to share your thoughts!