Back to AI Research

AI Research

TerminalWorld: Benchmarking Agents on Real-World Te... | AI Research

Key Takeaways

  • TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks TerminalWorld is a new data engine designed to evaluate how well autonomous AI agents perform...
  • We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings.
  • From these, we curate a Verified subset of 200 representative, manually reviewed tasks.
  • Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%.
  • Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20).
Paper AbstractExpand

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at this https URL .

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
TerminalWorld is a new data engine designed to evaluate how well autonomous AI agents perform in real-world terminal environments. While existing benchmarks often rely on manually created "adversarial" puzzles that may not reflect actual developer work, TerminalWorld automatically turns thousands of authentic, human-recorded terminal sessions into standardized evaluation tasks. By doing so, it provides a scalable and evolving way to measure how AI agents handle the complex, multi-step workflows that developers encounter in their daily practice.

From Recordings to Benchmarks

The engine processes raw terminal session recordings from the public platform asciinema. Because these recordings are often noisy or lack clear instructions, the engine uses a multi-step automated process to refine them. First, it distills the developer's intent into a clear, outcome-oriented task instruction and extracts a clean reference solution. Second, it reverse-engineers the necessary software environment by building and refining Docker containers to ensure the task can be reliably reproduced. Finally, it uses a trial-based loop to generate and calibrate test suites, ensuring that the tasks are solvable, non-trivial, and capable of accurately judging an agent's success.

A Diverse and Scalable Dataset

By processing 80,870 raw recordings, the engine produced a full benchmark of 1,530 validated tasks. This collection covers 18 real-world categories—such as container orchestration and CI/CD pipelines—and includes 1,280 unique commands. This is significantly broader than existing expert-curated benchmarks, with 91% of these commands absent from previous datasets like Terminal-Bench. Because the engine is automated, it can continue to grow as new recordings are uploaded, allowing the benchmark to evolve alongside changing developer practices.

Performance of Current AI Agents

The researchers tested eight frontier models and six leading terminal agents on a manually reviewed subset of 200 tasks. The results show that even the most advanced systems struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. The study also found that scores on older benchmarks like Terminal-Bench have a weak correlation with performance on TerminalWorld, suggesting that existing tests do not fully capture the skills required for real-world terminal tasks. Furthermore, the data shows that agents often solve these tasks using different command paths than the original human developers, highlighting the flexibility of AI in navigating complex software environments.

Comments (0)

No comments yet

Be the first to share your thoughts!