AI Research

AutoLab: Can Frontier Models Solve Long-Horizon Aut... | AI Research

Key Takeaways

Scientific and engineering progress is rarely the result of a single, perfect idea; it is an iterative process of proposing changes, testing them, and refini...
Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts.
To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization.
AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization.
Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget.

Paper AbstractExpand

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.

Scientific and engineering progress is rarely the result of a single, perfect idea; it is an iterative process of proposing changes, testing them, and refining the results. While current AI models excel at short, one-off tasks, they often struggle with long-horizon projects that require sustained, hours-long effort. The researchers behind AutoLab have introduced a new benchmark designed to test whether frontier AI models can handle these complex, multi-step research and engineering challenges.

The AutoLab Benchmark

AutoLab consists of 36 expert-curated tasks across four domains: system optimization, puzzle and challenge, model development, and CUDA kernel optimization. Unlike standard benchmarks that look for a single correct answer, AutoLab provides a working but intentionally suboptimal baseline. The AI agent is then given a strict wall-clock budget—ranging from two to twelve hours—to iteratively improve that baseline. To ensure the benchmark remains rigorous, it uses "sealed" evaluators and correctness gates to prevent models from "hacking" the system or taking shortcuts to achieve a high score.

Why Persistence Matters

The researchers evaluated 17 state-of-the-art models to see how they performed under these conditions. The results revealed a critical insight: an agent’s initial attempt at a solution is a poor predictor of its final success. Instead, the most successful models were those that demonstrated persistence. The top-performing models were those that repeatedly benchmarked their progress, edited their code, and incorporated empirical feedback throughout the entire duration of the task.

Key Findings and Limitations

The study found that while some models, such as claude-opus-4.6, show strong capabilities in long-horizon optimization, many other frontier models struggle. A common issue is a lack of "time awareness." Some models terminate their work prematurely, while others exhaust their entire time budget without producing a final, valid solution. Even highly capable models often failed to make meaningful progress because they could not sustain the iterative loop required for complex engineering.

Moving Toward Capable Agents

The AutoLab project highlights that long-horizon optimization is a distinct skill that cannot be reduced to simple coding ability. By open-sourcing the benchmark, the researchers aim to shift the focus of AI development toward agents that can manage compute, time, and noisy data over extended periods. The findings suggest that for AI to become a true partner in scientific and engineering research, future development must prioritize persistence and the ability to learn from empirical, iterative feedback.

Comments (0)

No comments yet

Be the first to share your thoughts!