AutoResearchBench is a new benchmark designed to test how well AI agents can perform autonomous scientific research. While AI agents have become proficient at general web browsing, scientific literature discovery requires a much deeper level of technical reasoning and precision. This benchmark evaluates whether an AI can navigate a massive, up-to-date scientific corpus to find specific papers or compile comprehensive lists of research based on complex, multi-part constraints.
Two Ways to Research
The benchmark splits scientific discovery into two distinct tasks that mirror real-world research workflows:
Deep Research: This task tests an agent's ability to track down a single, specific paper. The agent must navigate through technical details, citations, and appendices to verify if a paper meets a set of highly specific, often hidden, criteria. In some cases, the agent must correctly conclude that no such paper exists.
Wide Research: This task focuses on comprehensive coverage. Instead of finding one target, the agent must identify every paper in the database that satisfies a given scientific condition. This requires the agent to balance broad exploration with strict filtering to ensure the final list is both complete and accurate.
Why This is Difficult
Unlike standard web-browsing benchmarks, which often rely on common-sense information or simple keyword matching, AutoResearchBench is designed to be "extraordinarily challenging." The tasks require the agent to understand complex scientific concepts and extract evidence from full-text documents, including tables, figures, and proof details. Because the number of qualifying papers is often unknown, the agent must demonstrate deliberate reasoning, deciding when to continue searching and when to stop.
Current Performance Gaps
The researchers tested several state-of-the-art AI models and end-to-end research systems on the benchmark. Despite these models performing well on general web-browsing tasks, they struggled significantly with scientific literature discovery.
The top-performing models achieved less than 10% accuracy on Deep Research and under 10% in the Intersection over Union (IoU) metric for Wide Research. Many other strong models fell below 5%. These results suggest that scientific literature discovery is a distinct and much harder frontier for AI than general web searching. The primary bottlenecks identified include weak scientific reasoning, difficulty managing long, multi-part queries, and an inability to effectively use full-text information to verify findings.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!