Back to AI Research

AI Research

Co-ReAct: Rubrics as Step-Level Collaborators for R... | AI Research

Key Takeaways

  • Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents Search-intensive AI agents often struggle with "ReAct" tasks—where they must decide step-by-st...
  • We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference.
  • At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation.
  • To make this guidance reliable, we train a dedicated rubric generator with GRPO.
  • On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models.
Paper AbstractExpand

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at this https URL .

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Search-intensive AI agents often struggle with "ReAct" tasks—where they must decide step-by-step how to search for information and when to stop. These agents frequently fall into traps like repeating the same search queries, stopping too early, or ignoring diverse perspectives. While researchers have previously used "rubrics" (checklists of quality criteria) to evaluate final reports, these are often too broad to help an agent during the actual search process. Co-ReAct introduces a new framework that uses rubrics as active, step-by-step collaborators, providing the agent with specific, actionable requirements at every decision point during its search.

Turning Rubrics into Actionable Guidance

Instead of using rubrics only to grade a finished report, Co-ReAct treats them as a "prescriptive" tool. Before an agent takes a step, a specialized rubric generator creates a set of criteria tailored to the current state of the search. This rubric acts as a roadmap, telling the agent exactly what it needs to look for next. By injecting these requirements directly into the agent's context, the system ensures that every action is purposeful and aligned with the specific needs of the research task at that exact moment.

Training for Real-World Reliability

A major challenge with AI-generated guidance is ensuring it is actually helpful rather than just sounding plausible. To solve this, the authors trained their rubric generator using a technique called Group Relative Policy Optimization (GRPO). Instead of simply rewarding the model for producing a "good" rubric, they used a list-wise approach: they provided multiple potential actions to expert judges and created a consensus ranking. The rubric generator is then rewarded based on how well its criteria align with this expert ranking. This ensures the generated rubrics are "discriminative"—meaning they are mathematically designed to help the agent distinguish between effective and ineffective next steps.

A Three-Step Loop for Better Research

Co-ReAct improves the standard search process by adding an "inject–verify–retry" loop to the agent's workflow:

  1. Inject: The generator provides a rubric specifying what the next action should achieve. 2. Verify: Before the agent executes its chosen action, an independent verifier checks the action against the rubric’s criteria. 3. Retry: If the action fails to meet the criteria, the agent is given one chance to re-plan its move based on specific feedback about what was missing.
    This structure allows the agent to self-correct in real-time. Furthermore, because the rubric generator is a modular component, it can be used as a "plug-in" to improve other existing search methods without requiring a complete overhaul of the underlying AI model.

Consistent Performance Gains

The researchers tested Co-ReAct on two major benchmarks, DeepResearchBench and SQA-CS-V2, using both open-source and frontier closed-source models. The results showed that Co-ReAct consistently outperformed standard ReAct agents and other common test-time compute methods. By providing a clear, step-level signal, the framework helped agents produce more comprehensive and accurate research results, proving that even sophisticated models benefit significantly from having an external, expert-aligned "collaborator" guiding their decision-making process.

Comments (0)

No comments yet

Be the first to share your thoughts!