The First Proof Second Batch project aims to provide a transparent, rigorous assessment of how current AI systems perform when tasked with solving research-level mathematical problems. By utilizing a set of ten unpublished problems contributed by active researchers, the project seeks to move beyond informal testing and establish a formal benchmark for evaluating AI capabilities in mathematical reasoning and proof generation.
Methodology and Transparency
The project emphasizes a high level of transparency and reproducibility. Between March and June 2026, the organizers curated ten problems across diverse fields such as stochastic partial differential equations, discrete geometry, and von Neumann algebras. To ensure the problems were not already solved in public forums, they were sourced directly from the research of the contributors.
The testing process was strictly controlled: the organizers ran all AI systems on their own cloud infrastructure to ensure consistent conditions. Four systems were tested—OpenAI’s ChatGPT 5.5 Pro and three academic harnesses—under a "one-shot" requirement, meaning the AI had to provide a complete proof without further human interaction.
Grading and Expert Review
To evaluate the AI-generated solutions, the project employed a double-blind peer review model similar to that of academic journals. Thirty expert mathematicians served as referees, grading each submission based on mathematical correctness, novelty, and the quality of exposition. Solutions were categorized as "essentially flawless," "requiring minor revisions," "requiring major revisions," or "rejected." This formal assessment allowed the team to distinguish between AI outputs that were genuinely insightful and those that merely mimicked mathematical language.
Key Findings
The results revealed a mixed performance across the board. Seven of the ten problems received at least one passing grade, with some AI-generated solutions described as essentially publishable. Notably, one system produced a novel approach to a stochastic PDE problem that impressed the referees.
However, the study also highlighted significant limitations. AI systems struggled with problems that did not have clear analogues in existing literature, and in one instance (metric geometry), no system made substantial progress. A recurring issue was the tendency for AI to handle routine steps with excessive detail while glossing over critical logical gaps, sometimes citing non-existent papers or failing to attribute work properly. The researchers noted that while academic teams could improve the quality of AI outputs through specialized harnesses, these improvements often came at a high financial cost.
Future Directions
The First Proof project views this benchmark as an ongoing effort. Following the formal results, the team plans to launch a community experiment in the summer of 2026, allowing the public to test AI systems on a smaller set of problems. A third, more comprehensive batch of problems is scheduled for development between August and October 2026, continuing the project's mission to inform both the mathematical community and the public about the evolving capabilities of AI in research.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!