A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction
A$^{2}$utoLPBench is a new framework designed to evaluate how well AI agents solve linear programming (LP) problems. Traditional benchmarks for this task rely on static, human-authored word problems, which are limited in size, prone to becoming outdated, and susceptible to "data leakage"—where models memorize the answers during training. This paper introduces a dynamic system that generates an unlimited supply of fresh LP problems on demand, complete with mathematically verified answers, providing a more robust and scalable way to test the true problem-solving capabilities of LLM-driven agents.
How the Generator Works
Instead of collecting problems from human experts, the benchmark uses an "inverse-KKT" construction method. The system starts by picking a known optimal point and dual variables, then works backward to derive the corresponding linear programming problem. Because the problem is built around a known solution using the Karush–Kuhn–Tucker (KKT) theorem, the correct answer is guaranteed by construction. This eliminates the need for human annotators or external solver calls to verify results, ensuring that the ground truth is always accurate and free from human transcription errors.
Agent-Friendly Evaluation
The benchmark is designed to be "plug-and-play" for AI agents. It includes a bundled Docker environment that provides everything an agent needs to interact with the benchmark, including a manual that explains how to use available tools. The system also features a "solver-critic" baseline, which uses a two-agent protocol: one agent proposes a solution, and a second agent audits the work. If the critic finds an error, the solver can refine its approach. This setup allows researchers to measure not just whether an agent gets the right answer, but how well it can self-correct and use tools to solve complex mathematical tasks.
Key Advantages
Because A$^{2}$utoLPBench is a generator rather than a fixed dataset, it offers several structural advantages:
Unlimited Supply: Users can generate as many problems as needed, at any scale, without relying on a limited budget for human labor.
Tunable Difficulty: The difficulty of the problems can be adjusted by changing the dimensions of the matrices involved, allowing researchers to test models across a spectrum of complexity.
Contamination Resistance: By using fresh random seeds, evaluators can create batches of problems that have never been seen by an AI model during its training phase, ensuring the benchmark remains a valid test of reasoning rather than memory.
Reproducibility: Independent batches of problems generated with the same parameters yield consistent scores, making it a reliable instrument for benchmarking progress.
Practical Implementation
The authors provide a reference snapshot of 256 instances to demonstrate the system, but the core value lies in its flexibility. The benchmark is designed to scale to any size, allowing for long-term evaluation as AI models become more capable. By moving away from static corpora, the researchers have created a system that can evolve alongside AI technology, providing a consistent and rigorous way to measure the end-to-end problem-solving performance of LLM-driven agents.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!