Back to AI Research

AI Research

Multi-Turn Evaluation of Deep Research Agents Under... | AI Research

Key Takeaways

  • Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback This research investigates whether Deep Research Agents (DRAs)—AI systems designed...
  • Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback?
  • To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps.
  • Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate.
  • Our code and results are publicly available at this https URL .
Paper AbstractExpand

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at this https URL .

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
This research investigates whether Deep Research Agents (DRAs)—AI systems designed to plan, search the web, and write detailed reports—can actually improve their work when given feedback. While most current benchmarks only evaluate a single, initial draft, this paper explores how these agents perform when they are asked to revise their reports over multiple turns. The authors test two different feedback methods to see if they can help agents overcome common research pitfalls and produce more accurate, comprehensive documents.

Evaluating Feedback Methods

The study compares two ways of guiding an AI agent. The first is "self-reflection," where the agent is asked to review and improve its own report without any outside help. The second is "process-level feedback," which uses a new method called Research Gap Inference (RGI). RGI analyzes the agent's previous report against a rubric to identify specific gaps in its research strategy—such as missing subtopics or poor source selection—and provides targeted guidance on how to fix those underlying issues.

How Research Gap Inference Works

To generate process-level feedback, RGI looks at the patterns of criteria the agent met and missed in its previous draft. Instead of just pointing out a single error, it clusters these successes and failures to understand the agent's broader research process. It then provides the agent with a concise message focusing on two or three key research themes. This forces the agent to independently find new evidence and adjust its search strategy in the next round, rather than just correcting a specific sentence.

Key Findings on Agent Performance

The results show a clear divide between the two feedback approaches. Under self-reflection, agents struggle to identify their own mistakes, often losing as many good points as they gain, leading to almost no net improvement. In contrast, process-level feedback leads to significant gains in the first round of revision, with normalized scores rising by 8 to 15 points. However, these improvements are difficult to sustain. In subsequent turns, agents often struggle to address new gaps without accidentally "forgetting" or regressing on information they had already successfully included in previous versions.

The Limits of Current Architectures

Even with the help of targeted, process-level guidance, the study concludes that reliable, multi-turn improvement remains a significant challenge for current DRA architectures. While the agents are capable of incorporating new information when specifically directed, the tendency to regress on previously satisfied criteria prevents the quality of the reports from compounding over time. This suggests that while external feedback is a powerful tool, the underlying systems still lack the consistency required to refine their research iteratively over many turns.

Comments (0)

No comments yet

Be the first to share your thoughts!