Back to AI Research

AI Research

LiveBrowseComp: Are Search Agents Searching, or Jus... | AI Research

Key Takeaways

  • This paper investigates whether AI search agents are truly performing research or simply using the web to confirm information they already "know" from their...
  • Are LLM-based search agents genuinely searching, or using the web to verify what they already know?
  • We study this question on BrowseComp with three diagnostics.
  • Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence.
  • These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find.
Paper AbstractExpand

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at this https URL .

This paper investigates whether AI search agents are truly performing research or simply using the web to confirm information they already "know" from their training data. The researchers identify a phenomenon called Intrinsic Knowledge Dependence (IKD), where agents rely on their internal memory to answer questions rather than actively discovering new information. To address this, the authors introduce a new benchmark, LiveBrowseComp, designed to test an agent's ability to find facts that do not yet exist in their internal knowledge base.

The Problem: Memory-Backed Verification

The researchers conducted three diagnostic tests on existing search benchmarks, such as BrowseComp. They discovered that agents often answer a significant percentage of questions without using any search tools at all. Furthermore, when the researchers blocked the agents from accessing the specific documents needed to answer a question, the agents performed worse than if they had not searched at all. This suggests that instead of using the web to learn, these agents are using it as a "confirmation interface" to verify hypotheses they have already generated from their own internal memory.

Introducing LiveBrowseComp

To move beyond this reliance on static memory, the authors created LiveBrowseComp. This benchmark consists of 335 human-authored questions that require multi-step reasoning and, crucially, rely on facts published within the 90 days prior to the benchmark's creation. By focusing on very recent, obscure, and non-salient events, the researchers ensured that the answers could not be found in the models' pre-existing training data. Each question was rigorously verified by humans to ensure it was solvable through search but impossible to answer using only historical knowledge.

Key Findings

The results from LiveBrowseComp reveal a significant gap in current AI capabilities. When faced with these time-sensitive, "long-tail" questions, the accuracy of all evaluated models dropped to below 2% in closed-book settings. Even when given full access to search tools, the models' performance plummeted by 25–40 points compared to their scores on traditional benchmarks. The study concludes that current search benchmarks often conflate a model's ability to memorize facts with its ability to perform genuine, evidence-driven discovery.

Implications for Future Research

The authors argue that as models become more knowledgeable, static benchmarks become increasingly unreliable because they reward memory rather than search skills. LiveBrowseComp serves as a tool to isolate true search capability. By forcing agents to operate outside their "knowledge boundary," the benchmark highlights that current agents struggle not because the tasks are inherently unsolvable, but because their reliance on internal verification fails when they encounter information they do not already possess.

Comments (0)

No comments yet

Be the first to share your thoughts!