Back to AI Research

AI Research

A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendl... | AI Research

Key Takeaways

  • A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction A$^{2}$utoLPBench is a new framework designed to evaluate how...
  • Most LP-from-text benchmarks are static datasets of word problems written and labeled by hand.
  • Once such a dataset is released, its size is fixed, its difficulty is fixed, and every problem can leak into the training data of future LLMs.
  • We present \textbf{A$^{2}$utoLPBench}, a benchmark for testing LLM-driven agents on linear programming problems written in plain text.
  • We first pick a feasible point and dual, then write down a problem for which that point is optimal and the objective value is known.
Paper AbstractExpand

Most LP-from-text benchmarks are static datasets of word problems written and labeled by hand. Once such a dataset is released, its size is fixed, its difficulty is fixed, and every problem can leak into the training data of future LLMs. We present \textbf{A$^{2}$utoLPBench}, a benchmark for testing LLM-driven agents on linear programming problems written in plain text. We first pick a feasible point and dual, then write down a problem for which that point is optimal and the objective value is known. The answer is known by construction, with no solver call and no human annotator. The evaluation environment bundles a reference solver-critic baseline and a Docker image whose usage instructions are written for an LLM-driven agent to read. With these in place, any agent can run the benchmark and get a calibrated score with one command. Because the benchmark is a generator rather than a fixed dataset, it has properties no fixed dataset can match: an unlimited supply of fresh problems, a difficulty knob set by $(n,m)$, ground-truth answers correct by construction, low LLM-side cost per problem relative to human authoring, repeatable scores across independent batches, and resistance to training-data leakage when fresh post-cutoff seed ranges are used.

A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction
A$^{2}$utoLPBench is a new framework designed to evaluate how well AI agents solve linear programming (LP) problems. Traditional benchmarks for this task rely on static, human-authored word problems, which are limited in size, prone to becoming outdated, and susceptible to "data leakage"—where models memorize the answers during training. This paper introduces a dynamic system that generates an unlimited supply of fresh LP problems on demand, complete with mathematically verified answers, providing a more robust and scalable way to test the true problem-solving capabilities of LLM-driven agents.

How the Generator Works

Instead of collecting problems from human experts, the benchmark uses an "inverse-KKT" construction method. The system starts by picking a known optimal point and dual variables, then works backward to derive the corresponding linear programming problem. Because the problem is built around a known solution using the Karush–Kuhn–Tucker (KKT) theorem, the correct answer is guaranteed by construction. This eliminates the need for human annotators or external solver calls to verify results, ensuring that the ground truth is always accurate and free from human transcription errors.

Agent-Friendly Evaluation

The benchmark is designed to be "plug-and-play" for AI agents. It includes a bundled Docker environment that provides everything an agent needs to interact with the benchmark, including a manual that explains how to use available tools. The system also features a "solver-critic" baseline, which uses a two-agent protocol: one agent proposes a solution, and a second agent audits the work. If the critic finds an error, the solver can refine its approach. This setup allows researchers to measure not just whether an agent gets the right answer, but how well it can self-correct and use tools to solve complex mathematical tasks.

Key Advantages

Because A$^{2}$utoLPBench is a generator rather than a fixed dataset, it offers several structural advantages:

  • Unlimited Supply: Users can generate as many problems as needed, at any scale, without relying on a limited budget for human labor.

  • Tunable Difficulty: The difficulty of the problems can be adjusted by changing the dimensions of the matrices involved, allowing researchers to test models across a spectrum of complexity.

  • Contamination Resistance: By using fresh random seeds, evaluators can create batches of problems that have never been seen by an AI model during its training phase, ensuring the benchmark remains a valid test of reasoning rather than memory.

  • Reproducibility: Independent batches of problems generated with the same parameters yield consistent scores, making it a reliable instrument for benchmarking progress.

Practical Implementation

The authors provide a reference snapshot of 256 instances to demonstrate the system, but the core value lies in its flexibility. The benchmark is designed to scale to any size, allowing for long-term evaluation as AI models become more capable. By moving away from static corpora, the researchers have created a system that can evolve alongside AI technology, providing a consistent and rigorous way to measure the end-to-end problem-solving performance of LLM-driven agents.

Comments (0)

No comments yet

Be the first to share your thoughts!