Large Language Models (LLMs) are increasingly used to synthesize complex scientific information, yet we lack a clear understanding of how well these models perform when tasks require combining knowledge across different scientific fields. Current benchmarks often rely on simple, single-turn questions that do not reflect the multi-step, interactive nature of real-world scientific research. To bridge this gap, the authors introduce XDomainBench, a new diagnostic benchmark designed to stress-test how LLMs handle interdisciplinary reasoning.
Evaluating Scientific Reasoning
XDomainBench moves beyond static testing by simulating interactive scientific workflows. It covers 20 different scientific domains and four task categories, totaling 8,598 interactive sessions. The researchers formalized the "composition order"—the complexity of combining different fields—and the "mixture structure" of these tasks. By incorporating eight realistic trajectory patterns, the benchmark mimics the dynamics of AI for Science (AI4S) scenarios, allowing for a systematic evaluation of how models perform as they move from single-discipline problems to complex, interdisciplinary ones.
The Phenomenon of Reasoning Collapse
Through large-scale evaluation, the researchers identified a consistent "reasoning collapse" in LLMs as the complexity of the composition order increases. This means that as models are asked to synthesize knowledge across more domains or more complex structures, their ability to reason effectively breaks down. The study identifies two primary drivers for this failure:
Direct Difficulty: The inherent challenge of the task increases as more domains are composed, making it harder for the model to maintain accuracy.
Interaction-Amplified Failures: In multi-step, interactive sessions, small errors tend to accumulate. These patterns trigger a chain reaction where reasoning breaks and the model experiences "domain confusion," ultimately leading to the total collapse of the session.
Implications for AI in Science
The findings from XDomainBench highlight a critical boundary in current LLM capabilities. While these models are powerful tools for knowledge synthesis, their performance is not robust when faced with the high-dimensional, multi-disciplinary requirements of real-world scientific inquiry. By formalizing these failure modes, the benchmark provides a clearer path for diagnosing why models fail in complex scientific workflows and underscores the need for more resilient reasoning architectures in AI4S applications.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!