Back to AI Research

AI Research

Using Aristotle API for AI-Assisted Theorem Proving... | AI Research

Key Takeaways

  • Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem This paper explores the capabilities and...
  • AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified.
  • This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6.
  • The paper contributes a reproducible Lean artifact and a precise analysis of its verified and unverified proof content.
  • # Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem
Paper AbstractExpand

AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified. This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6. The generated artifact states a generalized Lean version of the theorem, contains four verified helper lemmas for local components of a maximality and adjacent-swap exchange strategy, and leaves the main theorem grasshopper closed directly by one unresolved sorry. The verified components establish that the final partial sum equals the total sum, that an adjacent transposition can affect only the relevant intermediate partial sum, that the changed partial sum has the expected form, and that maximality at a position admitting an adjacent successor swap forces a corresponding forbidden-set membership fact. The Aristotle output summary identifies the intended remaining mathematical step as the global counting step needed to show that these membership facts produce at least n distinct forbidden values, contradicting the cardinality assumption |M| < n; the Lean source itself does not reduce the main theorem to a separately encoded counting lemma. This case study gives an inspectable example of a central limitation in AI-assisted formalization, namely that local proof search can succeed while the global combinatorial bookkeeping required for a theorem remains unresolved. The paper contributes a reproducible Lean artifact and a precise analysis of its verified and unverified proof content.

Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

This paper explores the capabilities and limitations of AI-assisted theorem proving by attempting to formalize a challenging combinatorial problem from the 2009 International Mathematical Olympiad (IMO) using the Aristotle API and the Lean 4 proof assistant. The study highlights the distinction between AI-generated code that is fully verified by a computer and code that relies on unresolved placeholders, providing a clear example of how local proof search can succeed while global mathematical reasoning remains incomplete.

The Challenge of Formalizing Olympiad Mathematics

The Grasshopper problem is a well-known, difficult combinatorial puzzle. It asks whether, given a set of jump lengths and a set of "forbidden" landing positions, one can order the jumps so that a grasshopper never lands on a forbidden spot. While human mathematicians can approach this using complex combinatorial arguments, teaching an AI to construct a rigorous, machine-checked proof in Lean 4 is significantly more difficult. The author uses this problem as a test case to see how well the Aristotle API can bridge the gap between informal mathematical intuition and formal, verified code.

Verified Components vs. Unresolved Proofs

The study reveals that the AI was successful in formalizing the "local" parts of the proof. Specifically, the Aristotle API generated four verified helper lemmas that handle the mechanics of the problem:

  • Confirming that the final jump lands at the expected total sum.

  • Proving that swapping two adjacent jumps only affects the specific intermediate landing position between them.

  • Calculating the exact value of that intermediate position after a swap.

  • Using a "maximality" argument to show that if a certain jump order is optimal, then landing on a forbidden spot forces other related positions to also be forbidden.
    Despite these successes, the main theorem remains unproven. In Lean, a proof can be "closed" using a command called sorry, which acts as a placeholder for an unfinished argument. The AI left the main theorem closed with this placeholder, meaning the computer has not actually verified the full solution.

The Global Reasoning Gap

The primary limitation identified in this case study is the gap between local calculations and global logic. To complete the proof, one would need to perform a "global counting argument"—a complex process of tracking how various forbidden values interact to show that they eventually contradict the problem's constraints.
The AI correctly identified that this counting step was the missing piece of the puzzle, but it was unable to translate that high-level strategy into the formal, step-by-step logic required by Lean. This illustrates a central challenge in modern AI-assisted mathematics: while AI can effectively handle small, isolated logical steps, it often struggles to synthesize these into the comprehensive, large-scale reasoning required to solve complex global problems.

Implications for Trustworthy AI

This research serves as a methodological warning for the field of AI-assisted formalization. It demonstrates that successful compilation of a Lean project does not automatically mean a theorem has been proven. Because the system allows for unresolved placeholders, researchers and users must carefully inspect the actual content of the generated code to distinguish between what has been rigorously verified and what remains an unproven assumption. The paper provides a transparent, reproducible example of this, showing that even when an AI generates a substantial amount of correct, verified code, the final result may still be incomplete.

Comments (0)

No comments yet

Be the first to share your thoughts!