Agentic Proving for Program Verification
This paper investigates how modern "agentic" AI systems—which use planning, tool use, and iterative refinement—perform at the complex task of program verification. While AI has made significant strides in solving mathematical proofs, program verification is more difficult because it requires reasoning about executable code, specific requirements, and strict logical constraints. The authors evaluate Claude Code on the CLEVER benchmark, a collection of problems designed to test whether an AI can generate correct code and prove that it meets its intended specifications within the Lean 4 theorem prover.
How the Agentic Approach Works
The researchers utilized an agentic framework powered by Claude Opus 4.6, equipped with specialized tools to interface with the Lean 4 environment. The system operates in a "compiler-in-the-loop" fashion: it initializes a project, attempts to write code or specifications, and uses the compiler's feedback to refine its work. The agents were given tasks ranging from generating formal specifications based on natural language descriptions to proving that a piece of code is mathematically equivalent to a given requirement. Crucially, the agents were also tasked with self-diagnosing their own failures, allowing them to identify whether a problem was caused by their own code or by errors within the benchmark dataset itself.
Key Findings and Performance
The results demonstrate that agentic systems are currently highly effective at foundational program verification, significantly outperforming previous benchmarks. Claude successfully generated valid specifications for 98.8% of the problems. When tasked with creating implementations that satisfy these specifications, the system achieved an 87.5% success rate using the benchmark's ground-truth requirements. Most impressively, when looking at the entire end-to-end pipeline—generating a specification, writing the code, and proving the code is correct—the system reached a 98.1% success rate on problems with consistent premises. Beyond these metrics, the model provided high-quality feedback, often correctly identifying bugs in the benchmark’s own ground-truth files.
Challenges in Evaluation
The study highlights a growing gap between the capabilities of modern AI and the design of existing verification benchmarks. A major hurdle identified is the reliance on "isomorphism-based scoring," which checks if an AI's specification is identical to a human-written one. The authors argue this is flawed because natural language is often ambiguous; there can be multiple valid ways to interpret a requirement. Because the AI often produced "arguably valid" interpretations that differed from the benchmark's single ground-truth, the researchers suggest that future evaluations need to be more flexible and resilient to these variations.
Implications for the Future
The findings suggest that the current generation of agentic provers has reached a level of maturity where they can reliably handle complex verification tasks. However, the authors note that as these tools improve, the benchmarks used to test them must also evolve. The fact that the AI was able to identify errors in the benchmark itself indicates that we are moving toward a point where AI can assist in the creation and debugging of formal verification datasets, rather than just being a student of them. The authors conclude that tight, compiler-integrated agentic paradigms are currently the most effective strategy for advancing the field of program verification.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!