Back to AI Research

AI Research

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clari... | AI Research

Key Takeaways

  • Large Language Models (LLMs) are increasingly used as scientific assistants, but most existing benchmarks assume that a scientific problem is already perfect...
  • Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use.
  • Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics.
  • We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users.
  • SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires.
Paper AbstractExpand

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at this https URL .

Large Language Models (LLMs) are increasingly used as scientific assistants, but most existing benchmarks assume that a scientific problem is already perfectly defined. In practice, researchers often start with "ill-posed" requests—tasks that are missing critical information or contain internal contradictions. If an AI assistant does not clarify these issues before beginning a simulation or analysis, it may produce results that are physically invalid or irrelevant to the user's actual goal. This paper introduces SCICONVBENCH, a new benchmark designed to evaluate how well LLMs can navigate these "upstream" conversational tasks across four fields: fluid mechanics, solid mechanics, materials science, and partial differential equations.

Identifying and Resolving Scientific Ambiguity

SCICONVBENCH tests two primary capabilities: disambiguation (eliciting missing information) and inconsistency resolution (detecting and correcting contradictory requirements). The benchmark uses a structured task ontology that defines the essential components of a scientific study, such as boundary conditions, material properties, and numerical constraints. By simulating a multi-turn dialogue between an AI assistant and a user, the benchmark measures whether the model can successfully identify these gaps and resolve them through conversation before finalizing a scientific specification.

A Rigorous Evaluation Framework

To ensure accurate assessment, the researchers developed a rubric-based evaluation framework that goes beyond simple end-state success. It distinguishes between "conversation-grounded" resolution—where the model explicitly asks the user for the missing information—and "silent" resolution, where the model makes an unverified assumption or performs an implicit repair. This distinction is critical because silent assumptions can lead to irreproducible or incorrect scientific outcomes. The benchmark tracks metrics like the Final Resolution Rate (FRR) and the Conversation-Grounded Resolution Rate (CGRR) to determine if models are truly collaborating with the user or simply guessing.

Key Findings on Model Performance

The study reveals that while current frontier models are relatively capable of resolving internal inconsistencies, they struggle significantly with eliciting missing information. For example, even the best-performing model resolved only 52.7% of disambiguation cases in fluid mechanics. Furthermore, the research highlights a persistent gap between a model's ability to produce a "correct" final specification and its ability to ground that specification in the conversation. Models frequently make silent assumptions that are not supported by the user dialogue, indicating that current AI assistants are not yet fully reliable for the upstream formulation of complex scientific tasks.

Implications for Scientific AI

The results demonstrate that computational science presents a much harder challenge for conversational AI than general-purpose tasks. When tested on a general-domain clarification dataset, models performed well, but their success rates dropped drastically when applied to the specific, high-stakes requirements of SCICONVBENCH. This suggests that general-purpose conversational skills do not automatically translate to scientific domains. By establishing this benchmark, the authors provide a foundation for developing more robust AI assistants that prioritize accurate, grounded, and transparent task formulation before any computational work begins.

Comments (0)

No comments yet

Be the first to share your thoughts!