This paper investigates whether AI search agents are truly performing research or simply using the web to confirm information they already "know" from their training data. The researchers identify a phenomenon called Intrinsic Knowledge Dependence (IKD), where agents rely on their internal memory to answer questions rather than actively discovering new information. To address this, the authors introduce a new benchmark, LiveBrowseComp, designed to test an agent's ability to find facts that do not yet exist in their internal knowledge base.
The Problem: Memory-Backed Verification
The researchers conducted three diagnostic tests on existing search benchmarks, such as BrowseComp. They discovered that agents often answer a significant percentage of questions without using any search tools at all. Furthermore, when the researchers blocked the agents from accessing the specific documents needed to answer a question, the agents performed worse than if they had not searched at all. This suggests that instead of using the web to learn, these agents are using it as a "confirmation interface" to verify hypotheses they have already generated from their own internal memory.
Introducing LiveBrowseComp
To move beyond this reliance on static memory, the authors created LiveBrowseComp. This benchmark consists of 335 human-authored questions that require multi-step reasoning and, crucially, rely on facts published within the 90 days prior to the benchmark's creation. By focusing on very recent, obscure, and non-salient events, the researchers ensured that the answers could not be found in the models' pre-existing training data. Each question was rigorously verified by humans to ensure it was solvable through search but impossible to answer using only historical knowledge.
Key Findings
The results from LiveBrowseComp reveal a significant gap in current AI capabilities. When faced with these time-sensitive, "long-tail" questions, the accuracy of all evaluated models dropped to below 2% in closed-book settings. Even when given full access to search tools, the models' performance plummeted by 25–40 points compared to their scores on traditional benchmarks. The study concludes that current search benchmarks often conflate a model's ability to memorize facts with its ability to perform genuine, evidence-driven discovery.
Implications for Future Research
The authors argue that as models become more knowledgeable, static benchmarks become increasingly unreliable because they reward memory rather than search skills. LiveBrowseComp serves as a tool to isolate true search capability. By forcing agents to operate outside their "knowledge boundary," the benchmark highlights that current agents struggle not because the tasks are inherently unsolvable, but because their reliance on internal verification fails when they encounter information they do not already possess.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!