Back to AI Research

AI Research

DeepWeb-Bench: A Deep Research Benchmark Demanding... | AI Research

Key Takeaways

  • What the paper is about Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a pro...
  • Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models.
  • Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone.
  • We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier.
  • Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation.
Paper AbstractExpand

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

What the paper is about

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

What it covers

DeepWeb-Bench : A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation Sixiong Xie ∗ , Zhuofan Shi ∗ , Haiyang Shen ∗,† , Jiuzheng Wang, Siqi Zhong Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing † , Yun Ma † Peking University {xsx1001, shizhuofan, hyshen}@stu.pku.edu.cn {jingxiang, mayun}@pku.edu.cn Project page: https://sixiongxie1001-dot.github.io/deep-research-benchmark2.0 ∗ Equal contribution. † Corresponding authors. Abstract Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench , a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12–14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models’ errors dominated by incomplete derivation and weak models’ by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only ρ = 0.61 \rho=0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code. 1 Introduction Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, has emerged as a prominent use case for frontier language models. It is currently deployed along two tracks: vertically integrated commercial products such as OpenAI Deep Research ( 34 ) , Claude Research ( 3 ) , Gemini Deep Research ( 15 ) , Perplexity Research-Pro ( 37 ) , and Kimi Researcher ( 31 ) ; and command-line coding-agent harnesses such as Claude Code ( 2 ) and Codex ( 33 ) , which practitioners increasingly pair with frontier backbones for open-web research. A benchmark useful for the current frontier needs to produce discriminative signal across both tracks. Benchmarks for web-based information tasks have risen in difficulty over successive waves, from single-fact web question answering ( 43 ; 29 ) through multi-page evidence assembly ( 44 ; 57 ) to long-horizon deep research ( 13 ; 17 ; 56 ; 45 ; 11 ) . Each wave represents a substantial increase in difficulty, yet frontier deep research products are now reported to score strongly on them ( 34 ; 3 ) . The available benchmarks therefore do not provide sufficient discriminative headroom for the current generation of agents. What makes a real deep research task hard is not that any individual fact is hidden, but that a defensible answer requires working with a large body of evidence at once. A financial analyst comparing several chip vendors, for example, consults regulatory filings, trade-press articles, industry-research notes, and earnings transcripts, holds many numbers in working memory, reconciles them when sources disagree, and composes a final figure through several layers of arithmetic and modeling assumptions. We target difficulty from three properties of the data itself: massive evidence collection rather than a handful of pages, cross-source reconciliation rather than single-source lookup, and long-horizon multi-step derivation rather than a single extraction step. We measure these properties through four capability families: Retrieval captures the evidence-collection baseline, Derivation and Reasoning capture multi-step composition under different analytical modes, and Calibration captures cross-source reconciliation and the ability to abstain when evidence is absent. Existing benchmarks typically address only a subset of these properties within a single task, whereas the three together, realized across all four families, are what distinguish real-world deep research workloads from short browsing questions or single-item expert questions. We introduce DeepWeb-Bench , a deep research benchmark that targets all three properties within every task and is substantially harder than existing benchmarks for the current generation of deep research agents. Each task asks an agent to produce, for a single subject domain, a broad set of quantitative analytical conclusions; most conclusions require evidence drawn from multiple authoritative documents and composed through multi-step derivation rather than retrieved from a single page. To enable automatic grading at this scale, the conclusions are presented as a matrix of entities against analytical dimensions organized into the four capability families, so that a task’s score decomposes into interpretable per-cell and per-family signals. Every reference answer is accompanied by a source-provenance record in which each supporting source is assigned one of four disclosure-based levels and cross-checked where possible, and scored by an explicit per-cell rule rather than a free-form judge, making the score easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models, including Codex CLI (the Codex command-line interface) ( 33 ) + GPT-5.5 and eight models hosted through Claude Code CLI (the Claude Code command-line interface) ( 2 ) : Claude Opus 4.7 ( 4 ) , Claude Sonnet 4.6 ( 5 ) , DeepSeek V4 Pro and Flash ( 12 ) , GLM 5.1 ( 53 ) , Qwen 3.6 Plus ( 38 ) , MiniMax M2.7 ( 30 ) , and Kimi K2.6 ( 32 ) . Native search and browsing tools in both command-line hosts are disabled; every model uses the benchmark-provided search, page-visit, and PDF-fetch tools. The strongest model attains 33.37% and the weakest 16.79%. Three findings emerge: (1) retrieval is not the bottleneck, as retrieval failures account for only 12–14% of errors while Derivation and Calibration failures exceed 70%; (2) strong and weak models fail in qualitatively different ways, with incomplete derivation dominating for strong models (31%) and hallucinated precision for weak models (38%); (3) models exhibit genuine specialization ( ρ = 0.61 \rho=0.61 , per-case disagreement up to 18.8 percentage points). Our contributions are threefold:

• We introduce DeepWeb-Bench , a deep research benchmark substantially harder than existing benchmarks, because each task jointly demands massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation.

• We pair every reference answer with a four-level source-provenance record and cross-source verification where available, making scores auditable against the underlying evidence.

• We evaluate nine frontier models and present analysis organized around four capability families, complemented by failure-mode annotation and case studies. 2 Related Work Benchmarks for web-based information tasks can be organized along a spectrum of task complexity, from single-step web search through multi-step deep search to full deep research. We review each in turn. Web search and browsing question answering. The earliest wave of benchmarks evaluates agents on single-fact retrieval or short browsing trajectories. SimpleQA ( 43 ) targets single-hop factoid recall, GAIA ( 29 ) covers general browsing-agent tasks, and WebWalker ( 46 ) and Mind2Web ( 10 ) extend the setting with longer trajectories and interactive web navigation. These benchmarks established much of the methodology for evaluating browsing agents, but headline scores on them are reported to be largely saturated by current deep research products ( 34 ; 3 ) . BrowseComp ( 44 ) and BrowseComp-ZH ( 57 ) raise the bar by requiring answers to be assembled from many web pages rather than read from a single source, but they remain focused on short-answer questions whose answers, once found, require no further derivation. Deep search. A more recent line of work evaluates tasks that require multi-step evidence gathering and aggregation, going beyond single-page retrieval but stopping short of the long-horizon quantitative derivation that characterizes deep research. DeepSearchQA ( 17 ) targets comprehensiveness in multi-step information-seeking across 17 fields. DRACO ( 56 ) evaluates accuracy, completeness, and objectivity across 10 domains using expert-crafted rubrics. WideSearch ( 45 ) benchmarks broad information-seeking in which an agent populates a structured table from web sources. DRBench ( 1 ) extends this paradigm to enterprise settings, LiveResearchBench ( 41 ) provides live user-centric evaluation, and DeepResearchGym ( 11 ) offers a reproducible sandbox on frozen corpora. Mind2Web 2 ( 16 ) evaluates 130 long-horizon agentic search tasks with an agent-as-a-judge framework. These benchmarks advance task complexity substantially, yet they typically grade the completeness or correctness of retrieved evidence rather than requiring the agent to compose a quantitative conclusion through multi-step derivation. Deep research. At the far end of the spectrum, deep research tasks require not only extensive evidence collection and cross-source reconciliation but also long-horizon multi-step derivation in which the agent must compose retrieved numbers into a final quantitative answer through explicit arithmetic and modeling assumptions. DeepResearch Bench ( 13 ) and DeepResearch Bench II ( 21 ) evaluate deep research agents across broad sets of research tasks requiring long-horizon browsing and multi-step synthesis. DeepResearch-9K ( 47 ) provides 9,000 multi-hop questions at three difficulty levels with search trajectories. OpenResearcher ( 24 ) builds an open offline pipeline for synthesizing long-horizon trajectories. AutoResearchBench ( 51 ) benchmarks scientific literature discovery where even the strongest models achieve below 10%. Adjacent work covers expert-level academic questions ( 14 ; 39 ) , scientific-literature synthesis ( 6 ) , domain-specific analytical workflows ( 8 ; 20 ; 54 ) , broader agentic evaluation ( 26 ; 48 ; 19 ; 49 ; 42 ; 28 ) , and benchmark construction ( 23 ; 22 ; 55 ; 7 ) . Several of these benchmarks were difficult at introduction, yet frontier deep research products are now reported to score strongly on them as well ( 34 ; 3 ) . DeepWeb-Bench sits at the deep research end of this spectrum and differs from prior work along two axes. On the difficulty side, each task is structured so that a complete answer jointly requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation, rather than any one of these in isolation, which raises the difficulty above prior benchmarks for the current frontier. On the auditability side, every reference answer is paired with a four-level source-provenance record and cross-source checks where available, so that any evaluator can inspect a reported score against the underlying evidence. 3 DeepWeb-Bench DeepWeb-Bench is organized around task matrices, cell-level reference records, and a shared evaluation protocol. A task covers one subject domain and asks an agent to complete an 8 × 8 8\times 8 matrix of quantitative research answers; each cell is graded independently against a reference answer, a derivation record, and supporting sources. 3.1 Task Format and Capability Coverage Figure 1 : Overview of DeepWeb-Bench . (a) Each task is an 8 × 8 8\times 8 matrix of entities against research dimensions; every cell is scored independently using a four-tier rubric ( { 1 , 0.5 , 0.25 , 0 } {1,0.5,0.25,0} ) and carries a reference answer with source-provenance labels and cross-source agreement. (b) The dimension axis covers four capability families, and every task spans multiple families. A task in DeepWeb-Bench is a deep research assignment over a single subject domain, in which an agent is asked to produce a broad set of quantitative analytical conclusions about that domain, rather than a single short answer. Each such conclusion is an answer to a specific analytical question about a specific object of study. For the artificial-intelligence accelerator industry, for example, the object of study is a particular company and one analytical question is what is this company’s artificial-intelligence-related revenue share and how did it change last year? ; for the Chinese new-energy vehicle market, one analytical question about a particular automaker is what is this company’s per-vehicle gross profit? or what is this company’s exposure to the European Union anti-subsidy tariff? Producing such a conclusion requires reading many authoritative documents, reconciling values that disagree, and composing the result through multi-step derivation; producing the full set of conclusions amplifies this load, because the same body of evidence feeds many conclusions and the agent has to work with it coherently across the whole task. The task interface is a structured matrix. Rows are comparable entities within the same subject domain, columns are analytical dimensions , and each (entity, dimension) pair is a cell with a quantitative answer (Figure 1 a). This format fixes the scope of the task and the atomic unit of grading: a task score is the mean of independent cell scores rather than a single holistic judgment on a long report. Each cell’s reference answer is either a precise value, a range estimate with a stated confidence and derivation method, or an explicit not available marker when the quantity is not disclosed by an authoritative source. Each cell also carries supporting sources, a source-provenance label for each source, and a marker indicating whether the sources agree, disagree, or provide only a single independent view. The provenance labels have four disclosure-based levels: T1 for primary filings, official disclosures, and final regulatory rules; T2 for methodology-published research and formal industry or statistical datasets; T3 for reputable media and sell-side research; and T4 for informal or unverified sources. These labels are a record of disclosure provenance rather than a learned quality score. Capability coverage. The dimension axis spans four capability families. Retrieval -type dimensions request a value directly disclosed by a primary document and establish the evidence-collection baseline: they test whether an agent can locate and read the authoritative source. Derivation -type dimensions, including chain derivation, cross-column comparison, and sum-of-the-parts decomposition, request a value that usually is not directly stated as the final answer and must be composed from multiple disclosed numbers through an explicit multi-step computation. Reasoning -type dimensions, including scenario reasoning and quantitative extrapolation, request the quantitative outcome under a stated counterfactual or forward trajectory; they cover cases where the agent must carry a model through to a quantitative answer rather than combine disclosed numbers. Calibration -type dimensions, including cross-source conflict resolution and hallucination resistance, probe how an agent responds when sources disagree or when no authoritative source supports a precise answer. Only a small share of cells in any task fall in the Retrieval family; the majority are Derivation, Reasoning, or Calibration cells, which pushes difficulty beyond retrieval. This taxonomy follows the separation in recent evaluation work between factual retrieval and factuality ( 43 ) , browsing or web traversal ( 29 ; 46 ; 10 ) , interactive tool use ( 36 ; 50 ) , and long-horizon agent tasks ( 26 ; 48 ) . DeepWeb-Bench adapts these distinctions to quantitative deep research by making derivation, source reconciliation, and calibrated abstention explicit per-cell capabilities rather than implicit properties of a whole answer. Entity-axis properties. The entity axis contains comparable entities from the same market segment. A task may include leading firms, smaller firms, vertically integrated firms, and firms with partial disclosure, as long as the comparison remains meaningful for the domain. Variation in disclosure is part of the task state: on some dimensions a calibrated agent should return not available for entities whose public filings do not support a precise answer, rather than forcing a number for every row. 3.2 Task Requirements and Construction A benchmark task that separates frontier deep research agents from simpler retrieval pipelines should reduce the chance that a lucky search query, a single authoritative page, or a narrow skill is enough for a high score. DeepWeb-Bench applies three checks. First, retrieval-only cells are limited, and non-retrieval cells generally require multiple sources plus computation, reasoning, or synthesis; their final values are usually not available as directly stated answers. Second, reference answers are grounded in source records that prioritize T1 and T2 evidence and record cross-source agreement ( consistent , divergent , or single ) where independent public sources are available. Third, the dimension set spans retrieval, chain derivation, cross-source conflict identification, hallucination resistance, scenario reasoning, and quantitative extrapolation, so that the task score does not reduce to a single skill. Domain experts build tasks by choosing a domain, curating a comparable entity set with meaningful disclosure variation, writing research dimensions that follow the capability constraints, and creating reference answers with derivation chains, source-provenance labels, and cross-source checks. 3.3 Dataset Statistics Figure 2 : Dataset statistics for the 100-task release. (a) Capability-family distribution over the eight dimensions in each task: 1 Retrieval, 4 Derivation, 1 Calibration, and 2 Reasoning dimensions. (b) Number of tasks in each industry category. (c) Matrix scale: every task contains 8 entities, 8 dimensions, and therefore 64 independently scored cells. (d) Reference-record density: the average number of cited source URLs per cell, independent publishers per cell, and derivation steps per derivation-type cell. DeepWeb-Bench currently comprises 100 tasks spanning six domain categories: Technology (25), Energy & Materials (20), Industrials & Transport (18), Consumer (16), Finance (12), and Healthcare & Pharma (9). Each task targets a distinct industry segment and uses an 8 × 8 8\times 8 matrix, yielding 100 × 8 × 8 = 6 , 400 100\times 8\times 8=6{,}400 atomic scoring cells. The 8 dimensions in every task follow a fixed capability-family split: 1 Retrieval, 4 Derivation, 1 Calibration, and 2 Reasoning. Figure 2 summarizes the benchmark along four dataset-only axes. Panel (a) reports the capability-family distribution: Derivation accounts for 50% of dimensions per task (4 of 8), Reasoning 25%, and Calibration and Retrieval each 12.5%, so the majority of cells require multi-step synthesis rather than direct lookup. Panel (b) reports the domain distribution across six industry categories. Panel (c) reports the task-matrix scale, and Panel (d) reports reference-record density: reference derivations cite on average 3.2 distinct URLs per cell and 22 per task, span 2.4 independent publishers per cell, and require 2.8 derivation steps per derivation-type cell. 3.4 Evaluation Protocol Each evaluated agent is placed in an isolated session per task, receiving the entity list, the dimension list, and the output-format specification. For each dimension, the prompt includes a natural-language question and a metric specification that states the requested quantity, unit, and answer format. No reference answer, source hint, or scoring rule is disclosed. All agents access the same three benchmark browsing tools: web_search for retrieving candidate pages, page_visit for reading web pages, and pdf_fetch for downloading and reading PDF documents. The budget is 200 tool calls and 30 minutes wall-clock per task; cells without an answer at budget exhaustion score 0. Native search and browsing tools in the host command-line interfaces are disabled. Each cell is scored by a fixed four-tier rubric ( { 1 , 0.5 , 0.25 , 0 } {1,0.5,0.25,0} ): 1 for a value and derivation within tolerance, 0.5 for a partially correct answer (e.g., the correct direction of change but a value outside tolerance, or a correct range without a derivation), 0.25 for a marginally relevant attempt, and 0 otherwise. On not available cells, any precise value scores 0; an explicit not available with justification scores 1; a wide range with an estimation method scores 0.5. The task score is the mean of cell scores; the benchmark score is the mean across tasks. Scoring is performed by an automated GPT-5.5 grader applying the per-cell rubric; a human validation on a stratified sample of 200 cells yields κ = 0.82 \kappa=0.82 agreement with the automated grader. 4 Experiments We evaluate nine frontier model configurations on the 100-task release of DeepWeb-Bench , report aggregate and capability-family scores, and analyze the main failure patterns. 4.1 Evaluation Setup We evaluate DeepWeb-Bench on nine frontier backbone models, covering the major model families that underlie current deep research products and command-line agent harnesses. Eight models are hosted through Claude Code CLI ( 2 ) : Claude Opus 4.7 ( 4 ) , Claude Sonnet 4.6 ( 5 ) , DeepSeek V4 Pro and DeepSeek V4 Flash ( 12 ) , GLM 5.1 ( 53 ) , Kimi K2.6 ( 32 ) , Qwen 3.6 Plus ( 38 ) , and MiniMax M2.7 ( 30 ) . The ninth configuration, Codex CLI ( 33 ) + GPT-5.5 ( 35 ) , accesses GPT-5.5 through the Codex CLI harness. Native search and browsing tools in both hosts are disabled. Every model receives only the benchmark-provided web_search , page_visit , and pdf_fetch tools, with the same per-task budget. Each (model, task) session is subject to the standard budget of 200 web fetches and 30 minutes of wall-clock time. Scoring is performed by an independent GPT-5.5 grader using a four-tier rubric ( { 1 , 0.5 , 0.25 , 0 } {1,0.5,0.25,0} ) applied per cell. The primary metric is the average task score, defined as the mean of cell rubric scores averaged across tasks. Of 900 model-task pairs (9 models × \times 100 tasks), 874 produced valid responses and were scored; the remaining 26 pairs are listed in the full result tables. 4.2 Main Results Table 1 : Main results on DeepWeb-Bench . All numbers are percentages. Overall score is the mean task score across the scored tasks for that model. The four capability-family columns report the mean score over cells in that family: Retrieval, Derivation, Calibration, and Reasoning. Minimum task score and Maximum task score are the lowest and highest task-level scores for the model. The release contains 6,400 cells; 874/900 model-task pairs were scored, yielding 55,936 scored cells. Model Overall score Retrieval Derivation Calibration Reasoning Minimum task Maximum task Codex CLI + GPT-5.5 33.37 37.84 32.55 34.16 32.38 16.80 92.19 Claude Opus 4.7 31.84 36.52 30.97 31.14 31.59 3.91 85.94 DeepSeek V4 Pro 28.68 32.89 27.73 29.77 27.94 0.00 82.03 GLM 5.1 28.18 34.19 27.06 29.56 26.70 15.63 84.38 Claude Sonnet 4.6 27.97 33.80 26.87 28.89 26.80 7.42 83.59 DeepSeek V4 Flash 27.73 33.72 26.77 28.39 26.37 1.17 82.03 Qwen 3.6 Plus 26.54 32.25 25.34 27.00 25.84 11.72 83.20 MiniMax M2.7 24.06 28.56 22.94 24.69 23.70 1.17 76.56 Kimi K2.6 16.79 26.21 15.36 16.39 15.13 1.56 89.84 Mean 27.17 32.83 26.10 27.73 26.19 – – Table 1 reports per-model results across 100 tasks. The Overall score column is the primary metric. The four capability-family columns show where each model’s score comes from, and the Minimum task and Maximum task columns report the lowest and highest per-task scores. Overall performance. Codex CLI + GPT-5.5 is the strongest model, with an overall score of 33.37%. Among models hosted through Claude Code CLI, Claude Opus 4.7 leads at 31.84%, followed by a middle cluster between 26% and 29%: DeepSeek V4 Pro, GLM 5.1, Claude Sonnet 4.6, DeepSeek V4 Flash, and Qwen 3.6 Plus. MiniMax M2.7 scores 24.06%, while Kimi K2.6 trails at 16.79%. The cross-model mean is 27.17%, and the 16.58-point gap between the strongest and weakest models leaves substantial headroom for current frontier agents. The Minimum task and Maximum task columns show that every model has wide per-task variation: even low-average models occasionally perform well on high-disclosure domains. 4.3 Fine-Grained Analysis Figure 3 : Fine-grained score variation across the 100 tasks. (a) Pairwise Spearman rank correlation between models’ task-level scores; the mean correlation is ρ = 0.61 \rho=0.61 , showing that models do not fail on exactly the same tasks. Model labels are abbreviated only inside the plot to save space. (b) Distribution of task-level scores for each model, where each box plot summarizes the 100 per-task scores for that model. Finding 1: Retrieval is not the bottleneck. Table 2 : Human-labeled failure modes in 500 failing cells. Failure mode Top four models Other five models Hallucinated precision 22% 38% Silent source choice 18% 14% Incomplete derivation 31% 24% Scope drift 15% 12% Retrieval gap 14% 12% To identify where cell-level score gaps originate, two annotators classify 500 failing cells into five failure modes (Table 2 ). Retrieval gap , in which the agent fails to locate a publicly indexable authoritative source, accounts for only 12–14% of failures in both strong and weak model groups. By contrast, modes that correspond to the Derivation and Calibration families together account for over 70% of failures: incomplete derivation (locating the right inputs but erring in the composition step) and hallucinated precision (committing to a precise value when the ground truth is not available or a range). The dominant difficulty is therefore multi-step composition and cross-source reconciliation rather than evidence collection alone. The capability columns in Table 1 show the same pattern at the aggregate level. Retrieval is the smallest slice (800 released cells) and has the highest aggregate score at 32.83%. The three non-retrieval families account for 87.5% of cells and score lower: 26.10% for Derivation, 27.73% for Calibration, and 26.19% for Reasoning. This gap shows that the main loss comes after initial evidence access, especially in numerical composition, reconciliation, and calibrated abstention. Finding 2: Strong and weak models fail in qualitatively different ways. Table 2 groups the 500 annotated failures by model strength (top-4 models versus the remaining five). The dominant failure mode shifts across groups. In the top-4 group, incomplete derivation accounts for 31% of failures: strong models retrieve the correct intermediate values but misapply a composition step (e.g., applying a gross-margin rate to total revenue instead of segment revenue). In the remaining-model group, hallucinated precision rises to 38%: weaker models commit to a confident precise number even when no authoritative source supports it, which the scoring rule penalizes to zero. This shift indicates a qualitative phase transition in failure modes as model capability increases, not merely a scalar reduction in error rate, and it suggests that improving derivation accuracy and improving calibration require different training interventions. This pattern also clarifies how the overall score should be read. A model can improve its Retrieval score by finding primary documents more reliably, but that improvement affects only one dimension per task. By contrast, Derivation and Reasoning together cover six dimensions per task and dominate the benchmark score. The strongest two models are not merely better retrievers: their largest advantage over the middle cluster appears in Derivation and Calibration, where they more often preserve scope, carry intermediate quantities through a computation, and abstain when disclosure is insufficient. This is why the main table reports capability columns alongside the overall score rather than treating them as a secondary diagnostic. Finding 3: Models exhibit genuine specialization across domains. Figure 3 a reports pairwise Spearman ρ \rho across 100 case scores. The mean is ρ = 0.61 \rho=0.61 , with no pair exceeding ρ = 0.79 \rho=0.79 , indicating that models make genuinely different errors rather than failing on a shared set of hard cases. The top-2 models (Codex CLI + GPT-5.5 and Claude Opus 4.7) together achieve the highest score on 79 of 100 cases, but on the most-disagreed cases the cross-model standard deviation reaches 18.8 percentage points, meaning that a model ranked near the top on one domain can rank near the bottom on another. Per-case cross-model averages range from 14.67% (the hardest case, mortgage REITs) to 83.07% (the easiest, luxury goods), with the hardest cases concentrated in domains requiring reconciliation of non-standardized financial disclosures and the easiest in domains where primary filings are abundant and uniform. Case studies. We present two case studies that illustrate the two dominant failure modes and show how the per-cell

Comments (0)

No comments yet

Be the first to share your thoughts!