Back to AI Research

AI Research

The Compressive Knowledge Graph Hypothesis: Which G... | AI Research

Key Takeaways

  • What the paper is about Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actuall...
  • Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses.
  • We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash.
  • We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics.
  • Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors.
Paper AbstractExpand

Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.

What the paper is about

Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.

What it covers

The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation? Shashwat Sourav 1,2,3,4 , Viktoriia Baibakova 5 , Sanjay Das 2 , Ran Elgedawy 2 , Maria Mahbub 2 , Emily Herron 2 , Tirthankar Ghosal 2 1 Washington University in St. Louis 2 Oak Ridge National Laboratory 3 Lawrence Berkeley National Laboratory 4 UniverseTBD 5 Lila Sciences [email protected] Abstract Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top- k k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph. The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation? Shashwat Sourav 1,2,3,4 , Viktoriia Baibakova 5 , Sanjay Das 2 , Ran Elgedawy 2 , Maria Mahbub 2 , Emily Herron 2 , Tirthankar Ghosal 2 1 Washington University in St. Louis 2 Oak Ridge National Laboratory 3 Lawrence Berkeley National Laboratory 4 UniverseTBD 5 Lila Sciences [email protected] 1 Introduction Knowledge graphs are increasingly used as a way to give language models structured context (Pan et al. , 2023 ; Ando and Zhang, 2005 ) . In scientific settings, this is especially appealing (Xiong et al. , 2024 ; Kulkarni et al. , 2025 ) . A graph can organize a problem into explicit concepts such as the material system, the failure mode, the proposed intervention, the mechanism, and the target property. In principle, this should help a model move from a vague answer to a more grounded scientific hypothesis (Baek et al. , 2024 ) . In practice, however, it is still unclear how much of that graph structure the model actually uses (Liu et al. , 2023 ; HagstrΓΆm et al. , 2024 ) . A model may benefit from a few salient entities, from the relation structure itself, or from only a small subset of the graph while ignoring the rest. If we do not separate these possibilities, it is difficult to know what role external knowledge graphs are really playing in hypothesis generation. Current work often treats knowledge graph prompting as a single intervention: provide the graph, then measure whether performance goes up or down (Pan et al. , 2023 ; Wen et al. , 2023 ) . That view is too coarse for scientific discovery settings. A graph can vary in several ways at once. It can be dense or sparse, coarse or semantically rich, shallow or multi-hop (Jin et al. , 2024 ; Mavromatis and Karypis, 2024 ) . It can also be partly corrupted, shuffled, or compressed into a targeted subgraph (Li et al. , 2023 ) . When model behavior changes under these conditions, the main question is whether the graph helps, and which part of the graph helps, how that changes across models, and what kind of information remains useful as model capability increases (Zhang et al. , 2022 ; Yasunaga et al. , 2022 ) . In this work, we study these questions in a battery-science hypothesis-generation setting. We compare three models with different capability levels: Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash (Jiang et al. , 2023 ; Grattafiori et al. , 2024 ; Comanici et al. , 2025 ) . For each scientific problem, we generate hypotheses under a family of KG conditions that vary density, ontology richness, topology, and control structure (Jin et al. , 2024 ; Mavromatis and Karypis, 2024 ; Ma et al. , 2024 ) . We then evaluate the generated outputs with metrics that separate entity use from relation use, and we add intervention-based analyses that test what information is necessary, sufficient, and causally important (Chen et al. , 2024 ) . This lets us move beyond the usual question of whether KG prompting works, and toward a more precise account of how models use external structured knowledge. Our main claim is what we call the Compressive Knowledge Graph Hypothesis. We propose that full-KG behavior is often recoverable from compact subgraphs rather than requiring the entire local graph. This is a redundancy-aware claim, as it does not require one ranking rule to uniquely identify the important triples. Instead, useful signal may be distributed across mechanism, intervention, failure-mode, and outcome-facing relations, and different compact subsets may preserve enough structure to recover similar hypothesis behavior (Zhang et al. , 2025 ; Long et al. , 2025 ) . Across analyses, we find that KG influence is real but model-dependent. Problem identity remains the dominant source of variation, while KG condition has a smaller but measurable effect. Targeted and top- k k subgraphs often recover much of the full-KG behavior; fixed-reference scoring shows that this is not merely an artifact of no-KG zeros; and an outcome-held-out control shows that compression is not only claimed-outcome leakage (Li et al. , 2024 ; Linders and Tomczak, 2025 ) . These findings matter for how knowledge graphs should be used in hypotheses generation. If the useful signal is concentrated in a compact subset, then supplying larger and denser graphs may not be the right design choice, especially for stronger models. A better approach may be to identify the small set of graph facts that actually steers generation. This has consequences not only for knowledge-graph prompting, but also for how we think about retrieval, structured context design, and evaluation in scientific language-model systems (Gurrapu et al. , 2023 ; Wang et al. , 2025 ) . Our contributions are as follows:

β€’ We study knowledge-graph-guided hypothesis generation across three models and multiple graph manipulations, including density, ontology, topology, random, shuffled, targeted, and compressed graph conditions.

β€’ We introduce an evaluation framework that separates provided-graph use from fixed-reference recovery, allowing no-KG, top- k k , random, shuffled, and full-KG outputs to be scored against the same full graph.

β€’ We provide evidence for a redundancy-aware Compressive Knowledge Graph Hypothesis : compact subgraphs often recover much of the full-KG behavior, including under an outcome-held-out control, but compression is not unique to a single semantic ranking rule. Figure 1: Overview of the KG-guided generation pipeline. Battery-science fields are converted into a directed KG, verbalized as triples, and provided to language models under different graph conditions. Outputs are evaluated for entity recall, relation fidelity, graph coverage, and semantic distance. 2 Related Work Prior work has studied many ways of combining knowledge graphs with language models, including using KGs as external memory, structured prompts, reasoning scaffolds, or sources of factual grounding. Early knowledge-enhanced language-model work incorporated KG entities, triples, or verbalized KG facts into pretraining and representation learning (Zhang et al. , 2019 ; Peters et al. , 2019 ; Agarwal et al. , 2020 ) . Surveys on knowledge-enhanced language models and LLM–KG integration argue that KGs can improve factuality, interpretability, and structured reasoning, while LLMs can help construct, complete, and verbalize KGs (Hu et al. , 2022 ; Pan et al. , 2023 ) . Recent GraphRAG work extends this idea by using graph-based indexing, graph-guided retrieval, and graph-enhanced generation to provide relational context for downstream tasks (Peng et al. , 2024 ; Han et al. , 2024 ) . However, in our work, we are trying to understand which graph facts are actually used during scientific hypothesis generation. Previous works have also developed methods that use KGs to guide hypothesis generation. KG-CoI integrates external structured knowledge into a chain-of-ideas process and uses KG support to reduce hallucinations in hypotheses generation (Xiong et al. , 2024 ) . Related systems use scientific KGs for link prediction, literature-based discovery, or candidate hypothesis ranking (Spangler et al. , 2014 ; Pu et al. , 2023 ; Borrego et al. , 2025 ; Kastrin et al. , 2025 ) . These approaches generally treat the KG as useful context or evidence. By contrast, we test the compressive view of KG utility: whether a small subset of high-value triples is sufficient to recover much of the full-KG behavior, whether removing that subset disrupts generation, and whether triple importance is governed by semantic role rather than graph topology alone. 3 Task and Experimental Setup We study KG-guided hypothesis generation in battery science (Chen et al. , 2026 ) . Each example contains a scientific problem statement and structured fields such as material system, component, failure mode, intervention, mechanism, target property, and claimed outcome. In total we used 100 problems. 1 1 1 The 100-problem evaluation dataset is available at https://huggingface.co/datasets/matter2mech/battery-science-problems . From these fields, we construct a directed KG whose typed triples connect the problem to relevant scientific concepts. The model is asked to generate a solution hypothesis either without graph context or with a verbalized set of subject-relation-object triples, a common strategy for injecting KG facts into language-model prompts (Liu et al. , 2019 ; Agarwal et al. , 2020 ; Baek et al. , 2024 ) . All KG variants are derived from the same local graph G p = ( V p , E p ) G_{p}=(V_{p},E_{p}) , with 15-18 typed triples per problem. Density variants vary subset size; ontology variants vary relation granularity; topology variants range from 2-hop context to full problem-to-outcome paths. Random, shuffled, and targeted/top- k k controls respectively test irrelevant triples, broken relations, and compact relevance-ranked subsets. Full definitions and size/context statistics are given in Appendices B.6 and G . Battery materials are a useful testbed because the problems are inherently relational and mechanism-driven. A good hypothesis must connect what material or component is involved, why it fails, how an intervention changes the mechanism, and which property or outcome should improve. In material science, the hypothesis must preserve links among degradation, ion transport, interfacial stability, capacity retention, and materials design. These recurring structures are central to battery research and can be represented as KG triples, making the domain well suited for testing whether models use graph relations rather than only surface entities (Kumbhar et al. , 2025 ) . We compare Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash in the main cross-family study, and run additional intra-family checks on Mistral-7B/12B/22B and Llama-3.1-8B/70B. For each problem, we keep the prompt fixed and vary only the graph condition. The KG manipulations span three axes ( 2 ): density (sparse, medium, dense), ontology richness (coarse T1 versus richer T3 multihop relations), and topology (2-hop versus full-path context). We also include control conditions: no KG, random KG, shuffled KG, targeted KG, and top- k k compressed subgraphs. Detailed definitions of the random, shuffled, entity-only, and relation-skeleton controls are given in Appendix B.3 . We test three claims. First, real KG structure should affect outputs more than irrelevant or corrupted graph context. Second, if KG utility is compressed, then a small top- k k subset should recover much of the full-KG behavior, while removing that subset should degrade the output. Third, graph influence should not be fully explained by simple topology alone: relation type, task relevance, and redundancy among local triples may all shape which compact subsets recover full-KG behavior. Figure 2: KG perturbation design. We vary the external knowledge graph along three axes: density, ontology richness, and topology. Density controls how many graph facts are supplied; ontology richness controls whether relations are coarse or semantically detailed; topology controls whether the model receives local 2-hop context or longer full-path structure. 4 Evaluation Metrics We design the evaluation to measure whether a knowledge graph changes the generated hypothesis, how it changes it. This distinction is important because a model may copy graph entities without using their relations, preserve relation language without recalling the correct objects, or generate a hypothesis that is semantically close to the full-KG output while covering only a small fraction of the graph. We therefore use a set of output-based metrics that separate entity recall, relation use, graph coverage, and semantic sensitivity. We distinguish two scoring views. Provided-graph metrics score an output against the graph supplied in that condition and measure whether the model used the given graph. These metrics are undefined for no-KG settings and can have different denominators for top- k k and full-KG conditions. We therefore also report fixed-reference metrics, which score every output against the same full KG for that problem. Fixed-reference scoring makes no-KG, random, shuffled, top- k k , and full-KG outputs directly comparable and avoids artificial zero scores for no-KG baselines. Human expert evaluation. Automatic graph-use metrics can show whether an output reflects KG content, but they do not fully measure whether the hypothesis is scientifically useful. We therefore add a blinded domain-rater assessment. A materials-science postdoctoral researcher rated five representative examples comparing No KG, Top-8 KG, and Full KG outputs. The outputs were anonymized and condition labels were hidden during rating. The rater scored each hypothesis on a 1-5 scale for problem alignment, mechanistic specificity, intervention specificity, scientific plausibility, and evidence faithfulness, and also gave pairwise preferences against the No-KG output. The full rating instructions are provided in Appendix A.1 . Triple Recall Rate. Triple Recall Rate (TRR) measures how much of the provided KG content appears in the final hypothesis at the object-entity level. For a graph condition with triples 𝒒 = { ( s i , r i , o i ) } i = 1 n \mathcal{G}={(s_{i},r_{i},o_{i})}{i=1}^{n} and generated hypothesis y y , we define TRR ​ ( y , 𝒒 ) = 1 | π’ͺ 𝒒 | ​ βˆ‘ o ∈ π’ͺ 𝒒 𝟏 ​ [ o ∈ y ] , \mathrm{TRR}(y,\mathcal{G})=\frac{1}{|\mathcal{O}{\mathcal{G}}|}\sum_{o\in\mathcal{O}{\mathcal{G}}}\mathbf{1}[o\in y], where π’ͺ 𝒒 \mathcal{O}{\mathcal{G}} is the set of object entities in the supplied graph. TRR captures whether the model recalls the entities made available by the graph. However, TRR alone does not show whether the model used the graph structure correctly, since an output can mention the right entities while ignoring or distorting their relations. Relation Fidelity Score. Relation Fidelity Score (RFS) measures whether the generated hypothesis preserves the semantic role of the KG relations. Each graph relation is mapped to a scientific entity such as failure mode, mechanism, intervention, material component, property, or outcome. RFS then measures whether the language of the generated hypothesis expresses the same relation type. This allows us to distinguish shallow entity copying from relation-aware graph use. For example, mentioning an electrolyte additive contributes to TRR, but it contributes to RFS only if the output uses it in a role consistent with the supplied graph relation, such as an intervention that stabilizes an interface or improves ionic transport. KG Triple Coverage. KG Triple Coverage (KTC) measures broader coverage of the graph content. While TRR focuses on object entities, KTC measures the fraction of supplied triples whose object-side content is represented in the generated hypothesis. This metric is useful for comparing full-KG, no-KG, random-KG, and compressed-KG conditions because it directly measures how much of the graph context is reflected in the output. A high KTC score indicates that the model uses a larger portion of the graph, whereas a low KTC score indicates that the model either ignores the graph or uses only a small subset of it. Fixed-reference graph recovery. Provided-graph metrics answer whether a model used the graph it received. To compare conditions with different graph sizes, we also compute fixed-reference variants. Let 𝒒 p full \mathcal{G}{p}^{\mathrm{full}} be the full KG for problem p p . For any condition c c , including no KG, random KG, shuffled KG, and top- k k , we compute TRR ref ​ ( y p , c ) = TRR ​ ( y p , c , 𝒒 p full ) . \mathrm{TRR}{\mathrm{ref}}(y_{p,c})=\mathrm{TRR}(y_{p,c},\mathcal{G}{p}^{\mathrm{full}}). Analogous fixed-reference versions are computed for relation fidelity and graph coverage. These metrics ask how much of the full problem graph is recovered in the output, regardless of which graph was provided to the model. Semantic distance to full-KG behavior. For sufficiency and comprehensiveness experiments, we also measure semantic distance between an ablated output and the corresponding full-KG output. Let y full y{\mathrm{full}} be the hypothesis generated with the full KG and y c y_{c} be the hypothesis generated under condition c c , such as top- k k triples or full KG with top- k k triples removed. We compute d sem ​ ( y c , y full ) = 1 βˆ’ cos ⁑ ( e ​ ( y c ) , e ​ ( y full ) ) , d_{\mathrm{sem}}(y_{c},y_{\mathrm{full}})=1-\cos\left(e(y_{c}),e(y_{\mathrm{full}})\right), where e ​ ( β‹… ) e(\cdot) is a sentence embedding function. Lower semantic distance means that the ablated condition better recovers the behavior induced by the full graph. This is our main metric for testing whether a small subset of triples is sufficient to approximate full-KG behavior. Additional metric implementation details are provided in Appendix B . Intra-problem versus inter-problem variation. To measure how strongly graph condition affects generation relative to the problem itself, we compare two sources of semantic variation. Intra-problem variation measures how much outputs change for the same scientific problem when the KG condition changes. Inter-problem variation measures how much outputs differ across different scientific problems. We summarize this using the variance ratio ρ = 𝔼 p ​ [ d ​ ( y p , c , y p , c β€² ) ] 𝔼 p β‰  p β€² ​ [ d ​ ( y p , c , y p β€² , c β€² ) ] , \rho=\frac{\mathbb{E}{p}\left[d(y{p,c},y_{p,c^{\prime}})\right]}{\mathbb{E}{p\neq p^{\prime}}\left[d(y{p,c},y_{p^{\prime},c^{\prime}})\right]}, where p p indexes problems and c , c β€² c,c^{\prime} index KG conditions. Smaller values of ρ \rho indicate that problem identity dominates over graph condition, while larger values indicate stronger sensitivity to graph context. Statistical testing. We use paired permutation tests for the main condition contrasts because each problem is evaluated under multiple KG conditions. We also report bootstrap confidence intervals for key deltas, including Ξ” \Delta TRR(real-random), Ξ” \Delta RFS(real-shuffled), and fixed-reference recovery differences. Because we test multiple model-metric-condition contrasts, we report both uncorrected and Holm/BH-corrected p-values in Appendix C . We show numerical zeros as p 0.999 p>0.999 . The main text emphasizes effect sizes, confidence intervals, and consistent directional patterns rather than isolated significance thresholds. Implementation details for RFS, KTC, and the deterministic top- k k triple-ranking rule are provided in Appendix B , with additional details in Appendices B.1 and B.2 . We compute semantic distance with a Sentence-Transformers encoder and verify (Appendix H ) that the sufficiency trend is robust to replacing MiniLM-L6 with MPNet-base. The two encoders give highly correlated distances (Spearman ρ = 0.965 \rho=0.965 ) and preserve the same monotonic decrease from k = 1 k=1 to k = 8 k=8 . Table 1: Model-level summary of KG utility. Ξ” \Delta TRR(real βˆ’ - random) measures sensitivity to replacing the real KG with a random KG; Ξ” \Delta RFS(real βˆ’ - shuffled) measures sensitivity to shuffling KG structure; Ξ” \Delta KTC(real βˆ’ - noKG) measures the gain from using the real KG over no KG context. The variance ratio is intra-problem semantic variation across KG conditions divided by inter-problem variation across scientific problems; lower values indicate that problem identity dominates more strongly over KG condition. Model Ξ” \Delta TRR Ξ” \Delta RFS Ξ” \Delta KTC Variance ratio Best density Best ontology Best topology Gemini 0.2900 0.1867 0.7569 0.3125 sparse T3_multihop full_path Llama-3.1-70B 0.0380 0.0100 0.0621 0.3838 sparse T1_coarse 2hop Mistral-7B 0.0080 -0.0800 0.0054 0.5148 dense T3_multihop 2hop 5 Results Our results support the redundancy-aware Compressive Knowledge Graph Hypothesis. Across all our experiments, we observe three main patterns. First, KG utility is model-dependent. Second, compact subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. Third, graph influence is not explained by a single selector: relation role matters, but random and topology-based subsets can also recover much of the signal when enough triples are retained. 5.1 Cross-family KG utility is selective Table 1 summarizes the main cross-family results. Gemini shows the strongest response to real KG structure, with Ξ” \Delta TRR(real–random) = 0.290 and Ξ” \Delta KTC(real–noKG) = 0.757. Llama-3.1-70B shows a smaller but positive effect, especially in graph coverage. Mistral-7B shows weak and brittle graph use, with almost no gain over random KG context and negative Ξ” \Delta RFS under the real-versus-shuffled comparison. The preferred graph structure also differs by model. Gemini 2.5 Flash benefits most from sparse, semantically rich, full-path graph context. Llama-3.1-70B benefits most from sparse, coarse, shorter-range context. Mistral-7B prefers denser scaffolding. This supports the selective-utility view: stronger models do not necessarily need more graph context, but they can benefit from a smaller set of high-signal facts. Full permutation tests and bootstrap confidence intervals are reported in Appendix C . We observe this in both the control comparisons and the structural ablations. In Table 9 , Gemini 2.5 Flash shows large and statistically stable gains when the real KG is compared against a random KG, especially for TRR and KTC. It also shows a strong gain in RFS when the real KG is compared against a shuffled KG, indicating that relation structure is being used in a meaningful way. Llama-3.1-70B shows a smaller but still positive pattern, with the clearest gain appearing in KTC under the real-versus-random comparison. By contrast, Mistral-7B shows little benefit relative to the random KG and is strongly degraded by shuffled structure, suggesting that its graph use is weak and brittle rather than robust. A mixed-effects variance decomposition shows that model identity explains more variance than KG condition across TRR, RFS, and KTC; we report the full table in Appendix C . These results show that external KG utility does not vanish across models, but it becomes more selective. Stronger models do not benefit uniformly from larger or richer graphs. Instead, they appear to benefit most from smaller, higher-signal graph contexts. Materials-science prompting guidance. The best graph condition differs by model, but the top examples share a common materials-science structure: the graph preserves a chain from failure mode to intervention, mechanism, target property, and outcome. Gemini 2.5 Flash performs best with sparse, semantically rich full-path graphs, suggesting that it can use compact mechanism-to-outcome chains. Llama-3.1-70B performs best with sparse coarse 2-hop graphs, suggesting that concise local grounding is sufficient. Mistral-7B benefits more from dense rich 2-hop graphs, suggesting that smaller open models may need more explicit local scaffolding. Representative top graph examples for each model are shown in Appendix A.3 . 5.2 Compressed subgraphs recover full-KG behavior We next test whether the full graph is necessary. The necessity analysis compares no KG, entity-only context, relation skeletons, targeted KG, and full KG; full results are reported in Appendix B.4 . Entity-only context does not recover the full-KG effect, showing that graph utility is not just lexical exposure to scientific terms. Relation skeletons preserve some relation-level signal but lose entity-specific grounding. Targeted KG recovers much of the full-KG behavior, indicating that useful graph signal can be preserved by compact subsets rather than requiring the entire graph. Appendix D.1 further shows that this compression effect is not unique to our semantic ranking heuristic: random and topology-based top- k k selectors also approach full-KG behavior as k k increases, although no single selector dominates across all models and k k values. The top- k k sufficiency and comprehensiveness analyses test this directly. In the sufficiency setting, we keep only the top- k k ranked triples and measure distance to the full-KG output. In the comprehensiveness setting, we remove the same top- k k triples and measure degradation. Figure 3 shows the key pattern: as k k increases, top- k k triples become increasingly sufficient to recover full-KG behavior, while removing them causes larger disruption. This is the central empirical signature of the Compressive Knowledge Graph Hypothesis. Because our top- k k ranking uses lexical overlap with the problem statement, we also test whether compression is merely an artifact of this heuristic. We compare semantic top- k k subsets against matched random- k k and topology-based subsets selected by degree, betweenness, and PageRank. The compression trend remains across ranking methods, so our main claim is not that the lexical ranking is optimal, but that full-KG behavior can be approximated by compact subsets of triples. Full results are reported in Appendix D.1 . The expert scores in Table 3 support the compression pattern. No-KG outputs are often plausible, but they are less mechanistically specific and less faithful to the supplied evidence. Top-8 KG nearly matches Full KG on mechanistic specificity and evidence faithfulness, while keeping scientific plausibility high. Because the rater saw anonymized outputs without condition labels, this provides a small sanity check that compact graph context recovers useful scientific grounding rather than merely increasing surface overlap. Full rating instructions are given in Appendix A.1 . On average, each per-problem KG contains 16.1 triples (median 16, range 15-18), so the top-8 subset corresponds to only 49.7% of the full local graph and top-4 to 24.8%. Full KG-size and context-length statistics are reported in Appendix G . This shows that the observed recovery is not due to using nearly the whole graph, but to a compact subset of graph facts. Figure 3: Sufficiency and comprehensiveness support KG compression. Keeping only the top- k k ranked triples increasingly recovers the full-KG output, while removing the same triples causes systematic degradation. Compact subsets are often sufficient to approximate full-KG behavior, while removing high-ranked subsets causes systematic degradation. Outcome-held-out control. Because the local KG includes claimed-outcome triples, we test whether compression is merely an artifact of exposing part of the target hypothesis. We remove all outcome-facing triples and rerun full-KG and top- k k conditions for Gemini and Llama-3.1-70B. Table 2 shows that removing outcome triples reduces relation fidelity, as expected, but does not eliminate graph signal. For Gemini, full-KG RFS drops from 0.580 to 0.439, while top-8 without outcome triples still reaches 0.458. For Llama-3.1-70B, full-KG RFS drops from 0.677 to 0.511, while top-8 without outcome triples reaches 0.550. Mechanism/intervention coverage is also preserved under top-8 no-outcome conditions. Thus, outcome-facing triples are high-leverage, but the compression effect is not only outcome leakage. Table 2: Outcome-held-out control. Removing claimed-outcome triples reduces relation fidelity but does not eliminate graph signal. Compact top-8 subgraphs without outcome triples preserve substantial relation and mechanism/intervention signal. Model Condition RFS Mech./Int. Gemini Full KG 0.580 0.075 Full KG – out. 0.439 0.075 Top-8 – out. 0.458 0.073 Llama-70B Full KG 0.677 0.107 Full KG – out. 0.511 0.093 Top-8 – out. 0.550 0.110 Table 3: Human Expert Evaluation A materials-science postdoc blindly scored five representative examples on a 1–5 scale. Top-8 KG closely tracks Full KG on mechanism and evidence grounding, while both improve over No KG. Criterion No KG Top-8 KG Full KG Problem alignment 3.2 4.0 4.2 Mechanistic specificity 2.4 3.8 4.0 Intervention specificity 2.6 3.6 3.8 Scientific plausibility 3.8 3.8 4.0 Evidence faithfulness 2.0 3.8 4.2 Pairwise vs No KG – 4/5 5/5 Top-8 close to Full – 4/5 – 5.3 Topology alone does not explain graph influence Compression alone does not identify which triples matter. We

Comments (0)

No comments yet

Be the first to share your thoughts!