Back to AI Research

AI Research

What Makes a Good Terminal-Agent Benchmark Task: A... | AI Research

Key Takeaways

  • What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design This paper provides a set of design pr...
  • Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models.
  • As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic.
  • This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench.
  • Most people write benchmark tasks the way they write prompts.
Paper AbstractExpand

Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

This paper provides a set of design principles for creating high-quality terminal-agent benchmarks. As AI models are increasingly tested on their ability to perform coding and system administration tasks, the author argues that current evaluation methods often fail because they treat benchmarks like prompts. While a prompt is designed to help an AI succeed, a benchmark should be designed to rigorously test if an AI can solve a problem independently. The paper outlines how to move away from common, flawed design patterns to create tasks that are truly adversarial, difficult, and legible.

The Problem with "Prompt-Style" Benchmarking

A major issue in current benchmark design is that authors often write instructions as if they are prompting an AI to complete a task. This leads to several predictable failure modes, such as over-prescriptive instructions that dictate every step, or "clerical" tasks that focus on formatting rather than conceptual problem-solving. When tasks are written this way, they often fail to measure an agent's actual capability, instead testing its ability to follow overly specific, hand-holding instructions. The author emphasizes that instructions should be written for a smart human—clear and concise, but not redundant or coercive.

Defining True Difficulty

The paper argues that real difficulty is conceptual, not environmental. A task should not be considered "hard" simply because it is resource-intensive, requires long wait times, or involves complex formatting. Instead, a good benchmark presents a genuine engineering challenge that requires the agent to reason, debug, and explore. The author suggests that the best tasks are derived from real-world problems that an experienced engineer would recognize. To ensure a task is effective, developers should test it against multiple models, observe where they fail, and determine if those failures stem from the task's inherent difficulty or from unfair, poorly designed constraints.

Preventing Reward Hacking

A critical concern for any benchmark is its resistance to "reward hacking," where an agent finds a way to pass a test without actually solving the underlying problem. The paper notes that over 15% of tasks in popular terminal-agent benchmarks are currently vulnerable to such exploits, ranging from simple output spoofing to modifying system libraries. To combat this, the author recommends that benchmark maintainers audit their environments for potential leaks and ensure that tests verify the final outcome rather than the specific implementation steps. By focusing on verifiable results, developers can create more robust and credible evaluations.

Practical Tips for Task Authors

To build better benchmarks, the author suggests a hands-on approach: developers should interact with their own containers, run the oracle solutions, and carefully analyze the logs of failing agents. By watching how agents struggle, authors can identify whether a task is truly challenging or simply broken. Furthermore, the author advocates for "literate programming" in task design—keeping instructions short, self-explanatory, and focused on the end goal. Ultimately, the goal is to create a benchmark that serves as a reliable, empirical signal of an AI's true capabilities in a real-world terminal environment.

Comments (0)

No comments yet

Be the first to share your thoughts!