Back to AI Research

AI Research

Holistic Evaluation and Failure Diagnosis of AI Agents | AI Research

Key Takeaways

  • Holistic Evaluation and Failure Diagnosis of AI Agents AI agents are increasingly used for complex, multi-step tasks, but current methods for evaluating them...
  • We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments.
  • This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict.
  • Per-category analysis shows our framework leading in more error categories than any other evaluator.
  • Holistic Evaluation and Failure Diagnosis of AI Agents AI agents are increasingly used for complex, multi-step tasks, but current methods for evaluating them are limited.
Paper AbstractExpand

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

Holistic Evaluation and Failure Diagnosis of AI Agents
AI agents are increasingly used for complex, multi-step tasks, but current methods for evaluating them are limited. Most existing benchmarks only report whether an agent succeeded or failed, leaving developers in the dark about why a failure occurred or where in the process it happened. This paper introduces a new framework that breaks down agent execution into individual, manageable parts to provide a precise, diagnostic view of agent performance.

A Two-Pronged Approach

The framework combines two perspectives to create a complete picture of agent behavior. The "bottom-up" approach evaluates individual units of activity—such as a single tool invocation or an LLM call—to pinpoint exactly where an error occurred and categorize its cause. The "top-down" approach looks at the agent’s overall performance, assessing high-level patterns like planning quality, tool efficiency, and whether the agent’s actions actually move it toward its goal. By combining these, the framework can identify both specific technical mistakes and broader, systemic issues that span multiple steps.

Solving the Long-Trace Problem

A major challenge in evaluating AI agents is that their execution traces can become very long, causing traditional "monolithic" judges—which try to analyze the entire trace at once—to lose focus or exceed their memory limits. The new framework solves this by decomposing the trace into independent, per-span assessments. Because each evaluation is focused on a small, specific part of the process, the framework remains accurate regardless of how long the agent’s execution trace is. This method also allows the system to provide natural language rationales for every verdict, making it easier for developers to understand the "why" behind a failure.

Superior Performance

When tested on the TRAIL benchmark, which includes both multi-agent and single-agent tasks, the framework outperformed existing methods across all key metrics. It achieved significant gains in localization accuracy (identifying exactly where an error happened) and joint accuracy (correctly identifying both the location and the type of error). Notably, the research found that the same AI model performed significantly better when used within this framework than when acting as a monolithic judge. This suggests that the primary bottleneck in agent evaluation is not the intelligence of the model itself, but the methodology used to guide it.

Key Takeaways

The framework demonstrates that evaluation is most effective when it is granular and structured. By moving away from simple "pass/fail" outcomes and toward a system that provides grounded, span-level evidence, developers can more effectively debug and improve their agents. The framework is designed to be compatible with standard observability tools, making it a practical solution for real-world AI development where understanding the root cause of an agent's failure is critical to building reliable, trustworthy systems.

Comments (0)

No comments yet

Be the first to share your thoughts!