TraceGraph: Shared Decision Landscapes for Diagnosi... | AI Research

Key Takeaways

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories Current methods for evaluating AI agents often rely on simple metrics...
Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score.
We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes.
For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced.
It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair.

Paper AbstractExpand

Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across trajectories spanning five benchmark splits, TraceGraph profiles reveal navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same TraceGraph landscape also motivates a trap-aware recovery pipeline for SWE-bench: aruntime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances, with provider-specific active components. Overall, TraceGraph provides a process vocabulary for asking what agent benchmarks test, where models diverge on a shared landscape, and how failure regions can guide downstream improvement.

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

Current methods for evaluating AI agents often rely on simple metrics like pass rates or total rewards, which fail to capture the nuances of how an agent actually navigates a task. The authors introduce TraceGraph, a framework designed to move beyond these aggregate scores by visualizing agent behavior as a "shared decision landscape." By mapping out the paths taken by multiple models, TraceGraph allows researchers to see exactly where agents succeed, where they encounter difficulties, and how they attempt to recover from errors.

Mapping Agent Behavior

TraceGraph functions by pooling interaction trajectories from various models into a single graph based on observable action-observation states. Before identifying which model took which path, the framework organizes these states into a shared landscape. It then overlays this map with "productive cores"—areas where agents successfully progress—and "trap regions," which are areas associated with failure. This allows researchers to summarize every rollout using three specific events: Access (reaching a state), Trap exposure (entering a failure-prone area), and Repair (successfully navigating out of a trap).

Revealing Hidden Differences

By applying this framework across five benchmark splits, the researchers discovered that aggregate scores often mask significant differences in agent behavior. For example, the profiles revealed that some benchmark splits primarily reward agents for avoiding traps entirely, while others reward agents for their ability to recover once a trap has been encountered. This process vocabulary provides a clearer way to understand what specific benchmarks are actually testing and how different models diverge in their decision-making processes.

Improving Performance with Trap-Aware Recovery

Beyond diagnosis, the TraceGraph landscape can be used to actively improve agent performance. The authors developed a trap-aware recovery pipeline for SWE-bench, a benchmark for software engineering agents. In this system, a runtime detector monitors the agent’s progress; if the agent enters a state identified as a historical "trap region," the system triggers a lightweight continuation policy. This approach proved effective, raising the official resolved rate on fired states from 40.4% to 43.5% on the per-provider subset and from 41.0% to 44.8% on common-fired instances. This demonstrates that identifying failure regions can serve as a practical guide for building more robust, self-correcting AI agents.