Holistic Evaluation and Failure Diagnosis of AI Agents
AI agents are increasingly used for complex, multi-step tasks, but current methods for evaluating them are limited. Most existing benchmarks only report whether an agent succeeded or failed, leaving developers in the dark about why a failure occurred or where in the process it happened. This paper introduces a new framework that breaks down agent execution into individual, manageable parts to provide a precise, diagnostic view of agent performance.
A Two-Pronged Approach
The framework combines two perspectives to create a complete picture of agent behavior. The "bottom-up" approach evaluates individual units of activity—such as a single tool invocation or an LLM call—to pinpoint exactly where an error occurred and categorize its cause. The "top-down" approach looks at the agent’s overall performance, assessing high-level patterns like planning quality, tool efficiency, and whether the agent’s actions actually move it toward its goal. By combining these, the framework can identify both specific technical mistakes and broader, systemic issues that span multiple steps.
Solving the Long-Trace Problem
A major challenge in evaluating AI agents is that their execution traces can become very long, causing traditional "monolithic" judges—which try to analyze the entire trace at once—to lose focus or exceed their memory limits. The new framework solves this by decomposing the trace into independent, per-span assessments. Because each evaluation is focused on a small, specific part of the process, the framework remains accurate regardless of how long the agent’s execution trace is. This method also allows the system to provide natural language rationales for every verdict, making it easier for developers to understand the "why" behind a failure.
Superior Performance
When tested on the TRAIL benchmark, which includes both multi-agent and single-agent tasks, the framework outperformed existing methods across all key metrics. It achieved significant gains in localization accuracy (identifying exactly where an error happened) and joint accuracy (correctly identifying both the location and the type of error). Notably, the research found that the same AI model performed significantly better when used within this framework than when acting as a monolithic judge. This suggests that the primary bottleneck in agent evaluation is not the intelligence of the model itself, but the methodology used to guide it.
Key Takeaways
The framework demonstrates that evaluation is most effective when it is granular and structured. By moving away from simple "pass/fail" outcomes and toward a system that provides grounded, span-level evidence, developers can more effectively debug and improve their agents. The framework is designed to be compatible with standard observability tools, making it a practical solution for real-world AI development where understanding the root cause of an agent's failure is critical to building reliable, trustworthy systems.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!