Back to AI Research

AI Research

Insights Generator: Systematic Corpus-Level Trace D... | AI Research

Key Takeaways

  • Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents Diagnosing why AI agents fail is currently a manual, inefficient process.
  • Diagnosing failures in LLM agents remains largely manual.
  • Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate.
  • This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens.
  • We formalize the problem of corpus-level trace diagnostics.
Paper AbstractExpand

Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
Diagnosing why AI agents fail is currently a manual, inefficient process. Developers typically inspect a handful of individual execution traces, form guesses, and repeat the process. This approach struggles to scale because modern agent traces can contain tens of thousands of tokens, and many performance issues—such as "silent failures" where an agent produces a wrong answer without triggering an error—only become visible when looking at large groups of data rather than single runs. This paper introduces the Insights Generator (IG), a multi-agent system designed to automatically analyze large collections of agent traces to identify, validate, and report on systematic behavioral patterns.

How the Insights Generator Works

The system functions through a structured, iterative loop that separates the discovery of patterns from their validation. Instead of feeding raw, massive trace files directly into an AI model, IG uses a stateful Python data layer. This layer allows the system to perform data science operations—like calculating statistics or comparing different cohorts of traces—without overwhelming the model's context window.
The system relies on three distinct roles:

  • The Orchestrator: Acts as the central manager, breaking down high-level diagnostic questions into smaller tasks and synthesizing the final report.

  • The Scout Agent: Explores a sample of traces to propose diverse, testable hypotheses about agent behavior.

  • The Investigator Agent: Takes those hypotheses and performs rigorous, corpus-scale validation to confirm or refute them using quantitative evidence.

Evaluating Diagnostic Quality

To measure success, the researchers developed a framework that evaluates both the quality of the generated reports and the real-world impact of the insights. They tested the system using both automated LLM judges and human experts. The evaluation focused on whether the reports were accurate, well-supported by evidence, and specific enough to be actionable. Across multiple benchmarks, the IG system consistently outperformed other approaches in "pairwise win rates," meaning human and AI judges preferred its reports over those generated by alternative multi-agent or single-agent methods.

Impact on Agent Performance

The researchers found that the quality of these diagnostic reports directly translates to better agent performance. When human experts used IG reports to guide improvements to their agent "scaffolds" (the underlying code and logic), they achieved a 30.4 percentage point performance improvement. This was nearly double the gain achieved by the next-best analysis system. The results suggest that by moving from manual, per-trace debugging to systematic, corpus-level analysis, developers can identify and fix hidden issues that were previously invisible, leading to more reliable and effective AI agents.

Comments (0)

No comments yet

Be the first to share your thoughts!