AxDafny: Agentic Verified Code Generation in Dafny
This paper introduces AxDafny, an agentic framework designed to automate the generation of both executable code and the formal proof artifacts required to verify that code. While traditional AI coding assistants often rely on testing against a set of examples, AxDafny uses the Dafny programming language and its built-in verifier to ensure code is logically correct according to specific requirements. By iteratively generating implementations, invariants, and assertions, the system aims to bridge the gap between AI-generated code and formally verified software.
How AxDafny Works
AxDafny operates through an iterative "generate-check-repair" loop. A proposer agent generates a candidate solution, which is then evaluated by a reviewer system. This reviewer performs two types of checks: deterministic filters that ensure the code hasn't been "cheated" (such as by using shortcuts to bypass verification) and a verifier-based check that uses the Dafny compiler to identify specific logical errors or missing proof steps.
When verification fails, the system provides the model with diagnostic feedback, including the location of the error and the specific proof obligation that was not met. The agent then uses this information to refine its code. A memory module tracks the history of these attempts, allowing the model to learn from previous mistakes and maintain a "scratchpad" of lessons learned throughout the process.
Evaluating Verified Code
To test the framework, the authors introduced LCB-Pro-Dafny, a new benchmark consisting of 250 competition-style programming problems translated into Dafny. Unlike previous benchmarks that focused primarily on filling in missing proof hints, this new set requires the model to synthesize both the program logic and the necessary proof structure from scratch. The researchers also evaluated AxDafny on the existing DafnyBench, a standard for measuring proof-hint generation.
Key Results
AxDafny demonstrated significant improvements in verification success. On DafnyBench, the system achieved a 92.7% verification success rate, outperforming the strongest previously reported baseline by 6.5 percentage points without needing a manually curated library of proof hints. On the more challenging LCB-Pro-Dafny benchmark, AxDafny showed a substantial improvement over direct, non-iterative generation, successfully verifying 56.4% of tasks compared to the 11.6% success rate of the baseline model.
Important Considerations
The research highlights a distinction between "verification success" and "runtime performance." When the team compiled verified Dafny solutions into Python to run them against standard competitive programming tests, they found that many verified programs failed due to resource limits, such as exceeding time or memory constraints. This occurs because Dafny specifications focus on functional correctness—ensuring the code does what it is supposed to do—rather than enforcing efficiency or speed. Consequently, while AxDafny is highly effective at producing logically sound code, generating verified code that also meets strict performance requirements remains an open challenge.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!