AI Research

DRFLOW: A Deep Research Benchmark for Personalized... | AI Research

Key Takeaways

Deep research (DR) systems are increasingly common, but they are typically designed to generate summaries or reports.
Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries.
In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps.
For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?".
Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources.

Paper AbstractExpand

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

Deep research (DR) systems are increasingly common, but they are typically designed to generate summaries or reports. The paper "DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction" argues that these systems often fail to address practical enterprise needs, which require agents to identify concrete, step-by-step workflows to solve specific problems. To bridge this gap, the authors introduce a new benchmark designed to evaluate how well AI agents can synthesize information from diverse sources to create personalized action plans.

A New Benchmark for Enterprise Tasks

The researchers developed DRFLOW to shift the focus of deep research from simple information retrieval to actionable workflow prediction. For example, rather than just summarizing a company's budgeting policy, an agent should be able to guide a user through the specific steps required to request a new headcount. The benchmark consists of 100 distinct tasks across five different domains, supported by 1,246 reference workflow steps that are grounded in over 3,900 individual sources.

Evaluating Workflow Performance

To measure how well an agent performs, the authors established seven diagnostic metrics. These metrics are designed to evaluate the quality of the predicted workflows by looking at:

Factual Grounding: Ensuring the steps are supported by the provided sources.
Step Recovery: Checking if the agent identifies the correct necessary actions.
Structural Ordering: Verifying that the steps follow a logical sequence.
Condition Resolution: Assessing how well the agent handles specific user constraints.
Personalization: Determining if the workflow is tailored to the specific user's request.

Performance of the DRFLOW-Agent

The authors also introduced the DRFLOW-Agent (DRFA), a reference agent specifically designed for workflow prediction. When tested against strong baseline agents, the DRFA showed improvement, achieving up to a 10.02% increase in average F1 score.

The Remaining Challenge

Despite the improvements offered by the DRFLOW-Agent, the authors emphasize that there is still significant room for growth. The results indicate that generating complete, accurate, and personalized workflows remains a difficult frontier for current deep research systems. The benchmark serves as a tool to highlight these ongoing challenges and encourage further development in agent-based workflow automation.

Comments (0)

No comments yet

Be the first to share your thoughts!