AI Research

AutoResearchBench: Benchmarking AI Agents on Comple... | AI Research

Key Takeaways

AutoResearchBench is a new benchmark designed to test how well AI agents can perform autonomous scientific research.
Autonomous scientific research is significantly advanced thanks to the development of AI agents.
One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims.
To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery.
These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging.

Paper AbstractExpand

Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at this https URL .

AutoResearchBench is a new benchmark designed to test how well AI agents can perform autonomous scientific research. While AI agents have become proficient at general web browsing, scientific literature discovery requires a much deeper level of technical reasoning and precision. This benchmark evaluates whether an AI can navigate a massive, up-to-date scientific corpus to find specific papers or compile comprehensive lists of research based on complex, multi-part constraints.

Two Ways to Research

The benchmark splits scientific discovery into two distinct tasks that mirror real-world research workflows:

Deep Research: This task tests an agent's ability to track down a single, specific paper. The agent must navigate through technical details, citations, and appendices to verify if a paper meets a set of highly specific, often hidden, criteria. In some cases, the agent must correctly conclude that no such paper exists.
Wide Research: This task focuses on comprehensive coverage. Instead of finding one target, the agent must identify every paper in the database that satisfies a given scientific condition. This requires the agent to balance broad exploration with strict filtering to ensure the final list is both complete and accurate.

Why This is Difficult

Unlike standard web-browsing benchmarks, which often rely on common-sense information or simple keyword matching, AutoResearchBench is designed to be "extraordinarily challenging." The tasks require the agent to understand complex scientific concepts and extract evidence from full-text documents, including tables, figures, and proof details. Because the number of qualifying papers is often unknown, the agent must demonstrate deliberate reasoning, deciding when to continue searching and when to stop.

Current Performance Gaps

The researchers tested several state-of-the-art AI models and end-to-end research systems on the benchmark. Despite these models performing well on general web-browsing tasks, they struggled significantly with scientific literature discovery.
The top-performing models achieved less than 10% accuracy on Deep Research and under 10% in the Intersection over Union (IoU) metric for Wide Research. Many other strong models fell below 5%. These results suggest that scientific literature discovery is a distinct and much harder frontier for AI than general web searching. The primary bottlenecks identified include weak scientific reasoning, difficulty managing long, multi-part queries, and an inability to effectively use full-text information to verify findings.

Comments (0)

No comments yet

Be the first to share your thoughts!