Back to AI Research

AI Research

AxDafny: Agentic Verified Code Generation in Dafny | AI Research

Key Takeaways

  • AxDafny: Agentic Verified Code Generation in Dafny This paper introduces AxDafny, an agentic framework designed to automate the generation of both executable...
  • We study agentic code generation in Dafny, where a model must generate both executable code and the proof artifacts for verification.
  • We present AxDafny, a verifier-guided repair framework that iteratively generates implementations, invariants, assertions, and termination arguments.
  • We also introduce LiveCodeBench-Pro-Dafny (LCB-Pro-Dafny), a benchmark of 250 competition-style programming problems translated into Dafny with formal specifications and a verifier-based evaluation harness.
  • On LCB-Pro-Dafny, AxDafny substantially improves verification success over baseline GPT-5.5 performance.
Paper AbstractExpand

We study agentic code generation in Dafny, where a model must generate both executable code and the proof artifacts for verification. We present AxDafny, a verifier-guided repair framework that iteratively generates implementations, invariants, assertions, and termination arguments. We also introduce LiveCodeBench-Pro-Dafny (LCB-Pro-Dafny), a benchmark of 250 competition-style programming problems translated into Dafny with formal specifications and a verifier-based evaluation harness. On LCB-Pro-Dafny, AxDafny substantially improves verification success over baseline GPT-5.5 performance. On DafnyBench, AxDafny achieves 92.7\% verification success, outperforming the strongest previously reported proof-hint baseline by 6.5 percentage points. Lastly, we show that verification success and runtime test performance measure different aspects of generated code.

AxDafny: Agentic Verified Code Generation in Dafny
This paper introduces AxDafny, an agentic framework designed to automate the generation of both executable code and the formal proof artifacts required to verify that code. While traditional AI coding assistants often rely on testing against a set of examples, AxDafny uses the Dafny programming language and its built-in verifier to ensure code is logically correct according to specific requirements. By iteratively generating implementations, invariants, and assertions, the system aims to bridge the gap between AI-generated code and formally verified software.

How AxDafny Works

AxDafny operates through an iterative "generate-check-repair" loop. A proposer agent generates a candidate solution, which is then evaluated by a reviewer system. This reviewer performs two types of checks: deterministic filters that ensure the code hasn't been "cheated" (such as by using shortcuts to bypass verification) and a verifier-based check that uses the Dafny compiler to identify specific logical errors or missing proof steps.
When verification fails, the system provides the model with diagnostic feedback, including the location of the error and the specific proof obligation that was not met. The agent then uses this information to refine its code. A memory module tracks the history of these attempts, allowing the model to learn from previous mistakes and maintain a "scratchpad" of lessons learned throughout the process.

Evaluating Verified Code

To test the framework, the authors introduced LCB-Pro-Dafny, a new benchmark consisting of 250 competition-style programming problems translated into Dafny. Unlike previous benchmarks that focused primarily on filling in missing proof hints, this new set requires the model to synthesize both the program logic and the necessary proof structure from scratch. The researchers also evaluated AxDafny on the existing DafnyBench, a standard for measuring proof-hint generation.

Key Results

AxDafny demonstrated significant improvements in verification success. On DafnyBench, the system achieved a 92.7% verification success rate, outperforming the strongest previously reported baseline by 6.5 percentage points without needing a manually curated library of proof hints. On the more challenging LCB-Pro-Dafny benchmark, AxDafny showed a substantial improvement over direct, non-iterative generation, successfully verifying 56.4% of tasks compared to the 11.6% success rate of the baseline model.

Important Considerations

The research highlights a distinction between "verification success" and "runtime performance." When the team compiled verified Dafny solutions into Python to run them against standard competitive programming tests, they found that many verified programs failed due to resource limits, such as exceeding time or memory constraints. This occurs because Dafny specifications focus on functional correctness—ensuring the code does what it is supposed to do—rather than enforcing efficiency or speed. Consequently, while AxDafny is highly effective at producing logically sound code, generating verified code that also meets strict performance requirements remains an open challenge.

Comments (0)

No comments yet

Be the first to share your thoughts!