Multi-Turn Evaluation of Deep Research Agents Under...

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
This research investigates whether Deep Research Agents (DRAs)—AI systems designed to plan, search the web, and write detailed reports—can actually improve their work when given feedback. While most current benchmarks only evaluate a single, initial draft, this paper explores how these agents perform when they are asked to revise their reports over multiple turns. The authors test two different feedback methods to see if they can help agents overcome common research pitfalls and produce more accurate, comprehensive documents.

Evaluating Feedback Methods

The study compares two ways of guiding an AI agent. The first is "self-reflection," where the agent is asked to review and improve its own report without any outside help. The second is "process-level feedback," which uses a new method called Research Gap Inference (RGI). RGI analyzes the agent's previous report against a rubric to identify specific gaps in its research strategy—such as missing subtopics or poor source selection—and provides targeted guidance on how to fix those underlying issues.

How Research Gap Inference Works

To generate process-level feedback, RGI looks at the patterns of criteria the agent met and missed in its previous draft. Instead of just pointing out a single error, it clusters these successes and failures to understand the agent's broader research process. It then provides the agent with a concise message focusing on two or three key research themes. This forces the agent to independently find new evidence and adjust its search strategy in the next round, rather than just correcting a specific sentence.

Key Findings on Agent Performance

The results show a clear divide between the two feedback approaches. Under self-reflection, agents struggle to identify their own mistakes, often losing as many good points as they gain, leading to almost no net improvement. In contrast, process-level feedback leads to significant gains in the first round of revision, with normalized scores rising by 8 to 15 points. However, these improvements are difficult to sustain. In subsequent turns, agents often struggle to address new gaps without accidentally "forgetting" or regressing on information they had already successfully included in previous versions.

The Limits of Current Architectures

Even with the help of targeted, process-level guidance, the study concludes that reliable, multi-turn improvement remains a significant challenge for current DRA architectures. While the agents are capable of incorporating new information when specifically directed, the tendency to regress on previously satisfied criteria prevents the quality of the reports from compounding over time. This suggests that while external feedback is a powerful tool, the underlying systems still lack the consistency required to refine their research iteratively over many turns.

Multi-Turn Evaluation of Deep Research Agents Under... | AI Research

Key Takeaways

Evaluating Feedback Methods

How Research Gap Inference Works

Key Findings on Agent Performance

The Limits of Current Architectures

Comments (0)

No comments yet