Back to AI Research

AI Research

XDomainBench: Diagnosing Reasoning Collapse in High... | AI Research

Key Takeaways

  • Large Language Models (LLMs) are increasingly used to synthesize complex scientific information, yet we lack a clear understanding of how well these models p...
  • Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized.
  • Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows.
  • To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning.
  • Current benchmarks often rely on simple, single-turn questions that do not reflect the multi-step, interactive nature of real-world scientific research.
Paper AbstractExpand

Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

Large Language Models (LLMs) are increasingly used to synthesize complex scientific information, yet we lack a clear understanding of how well these models perform when tasks require combining knowledge across different scientific fields. Current benchmarks often rely on simple, single-turn questions that do not reflect the multi-step, interactive nature of real-world scientific research. To bridge this gap, the authors introduce XDomainBench, a new diagnostic benchmark designed to stress-test how LLMs handle interdisciplinary reasoning.

Evaluating Scientific Reasoning

XDomainBench moves beyond static testing by simulating interactive scientific workflows. It covers 20 different scientific domains and four task categories, totaling 8,598 interactive sessions. The researchers formalized the "composition order"—the complexity of combining different fields—and the "mixture structure" of these tasks. By incorporating eight realistic trajectory patterns, the benchmark mimics the dynamics of AI for Science (AI4S) scenarios, allowing for a systematic evaluation of how models perform as they move from single-discipline problems to complex, interdisciplinary ones.

The Phenomenon of Reasoning Collapse

Through large-scale evaluation, the researchers identified a consistent "reasoning collapse" in LLMs as the complexity of the composition order increases. This means that as models are asked to synthesize knowledge across more domains or more complex structures, their ability to reason effectively breaks down. The study identifies two primary drivers for this failure:

  • Direct Difficulty: The inherent challenge of the task increases as more domains are composed, making it harder for the model to maintain accuracy.

  • Interaction-Amplified Failures: In multi-step, interactive sessions, small errors tend to accumulate. These patterns trigger a chain reaction where reasoning breaks and the model experiences "domain confusion," ultimately leading to the total collapse of the session.

Implications for AI in Science

The findings from XDomainBench highlight a critical boundary in current LLM capabilities. While these models are powerful tools for knowledge synthesis, their performance is not robust when faced with the high-dimensional, multi-disciplinary requirements of real-world scientific inquiry. By formalizing these failure modes, the benchmark provides a clearer path for diagnosing why models fail in complex scientific workflows and underscores the need for more resilient reasoning architectures in AI4S applications.

Comments (0)

No comments yet

Be the first to share your thoughts!