Back to AI Research

AI Research

AgentEscapeBench: Evaluating Out-of-Domain Tool-Gro... | AI Research

Key Takeaways

  • AgentEscapeBench is a new diagnostic benchmark designed to test how well AI agents can perform complex, multi-step reasoning when using external tools.
  • As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions.
  • We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints.
  • AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation.
  • Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation.
Paper AbstractExpand

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

AgentEscapeBench is a new diagnostic benchmark designed to test how well AI agents can perform complex, multi-step reasoning when using external tools. Unlike existing benchmarks that often rely on familiar tasks—such as booking travel or writing code—this framework places agents in "escape room" scenarios. These tasks require agents to navigate unfamiliar environments, interpret narrative clues, and manage long chains of dependencies where one tool's output is required for the next. The goal is to measure an agent’s ability to reason and adapt in novel situations rather than simply repeating learned patterns.

How the Benchmark Works

The benchmark uses an automated pipeline to generate 270 unique tasks across five difficulty levels. Each task is structured as a directed acyclic graph (DAG), where nodes represent tools or items. To succeed, an agent must:

  • Extract clues: Read narratives to understand how to use unfamiliar tools.

  • Manage state: Track which items have been discovered and which tools are ready to be used.

  • Propagate results: Correctly pass the output of one tool into the input of the next.

  • Solve the puzzle: Reach a final, deterministically verifiable answer.
    The system uses an incremental-disclosure mechanism: as an agent successfully uses a tool, it reveals new items or tools, forcing the agent to constantly revise its plan as it learns more about the environment.

Key Findings and Performance

The study evaluated sixteen different AI models alongside human participants. The results show a clear "reasoning gap" between humans and machines:

  • Performance Drop: While human success rates remained relatively stable as tasks became more complex (dropping from 98.3% to 80.0%), AI performance collapsed significantly. The best-performing model dropped from 90.0% to 60.0% as the difficulty increased.

  • The Chaining Bottleneck: The primary failure point for AI agents is not the individual tool call, but the ability to chain multiple steps together. Even when models could solve individual parts of a puzzle, they struggled to maintain the correct sequence of operations over long dependency chains.

  • Reasoning Models: Interestingly, models specifically designed for "reasoning" or extended deliberation did not consistently outperform standard chat models. This suggests that the current bottleneck is not just internal logic, but the ability to ground that logic in real-time, external tool interactions.

Diagnostic Insights

By analyzing the interaction logs, the researchers identified specific failure signatures that explain why models struggle. As tasks grow more complex, models increasingly suffer from "premature invocation," where they attempt to use tools before the necessary upstream data is available. Additionally, many models struggle with "clue adherence," often resorting to guessing parameter values rather than correctly propagating the results they have already generated. These findings suggest that while modern AI agents are becoming proficient at local, short-range tool use, they still lack the robust, long-range planning capabilities required for truly general-purpose, autonomous work.

Comments (0)

No comments yet

Be the first to share your thoughts!