SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clari...

Large Language Models (LLMs) are increasingly used as scientific assistants, but most existing benchmarks assume that a scientific problem is already perfectly defined. In practice, researchers often start with "ill-posed" requests—tasks that are missing critical information or contain internal contradictions. If an AI assistant does not clarify these issues before beginning a simulation or analysis, it may produce results that are physically invalid or irrelevant to the user's actual goal. This paper introduces SCICONVBENCH, a new benchmark designed to evaluate how well LLMs can navigate these "upstream" conversational tasks across four fields: fluid mechanics, solid mechanics, materials science, and partial differential equations.

Identifying and Resolving Scientific Ambiguity

SCICONVBENCH tests two primary capabilities: disambiguation (eliciting missing information) and inconsistency resolution (detecting and correcting contradictory requirements). The benchmark uses a structured task ontology that defines the essential components of a scientific study, such as boundary conditions, material properties, and numerical constraints. By simulating a multi-turn dialogue between an AI assistant and a user, the benchmark measures whether the model can successfully identify these gaps and resolve them through conversation before finalizing a scientific specification.

A Rigorous Evaluation Framework

To ensure accurate assessment, the researchers developed a rubric-based evaluation framework that goes beyond simple end-state success. It distinguishes between "conversation-grounded" resolution—where the model explicitly asks the user for the missing information—and "silent" resolution, where the model makes an unverified assumption or performs an implicit repair. This distinction is critical because silent assumptions can lead to irreproducible or incorrect scientific outcomes. The benchmark tracks metrics like the Final Resolution Rate (FRR) and the Conversation-Grounded Resolution Rate (CGRR) to determine if models are truly collaborating with the user or simply guessing.

Key Findings on Model Performance

The study reveals that while current frontier models are relatively capable of resolving internal inconsistencies, they struggle significantly with eliciting missing information. For example, even the best-performing model resolved only 52.7% of disambiguation cases in fluid mechanics. Furthermore, the research highlights a persistent gap between a model's ability to produce a "correct" final specification and its ability to ground that specification in the conversation. Models frequently make silent assumptions that are not supported by the user dialogue, indicating that current AI assistants are not yet fully reliable for the upstream formulation of complex scientific tasks.

Implications for Scientific AI

The results demonstrate that computational science presents a much harder challenge for conversational AI than general-purpose tasks. When tested on a general-domain clarification dataset, models performed well, but their success rates dropped drastically when applied to the specific, high-stakes requirements of SCICONVBENCH. This suggests that general-purpose conversational skills do not automatically translate to scientific domains. By establishing this benchmark, the authors provide a foundation for developing more robust AI assistants that prioritize accurate, grounded, and transparent task formulation before any computational work begins.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clari... | AI Research

Key Takeaways

Identifying and Resolving Scientific Ambiguity

A Rigorous Evaluation Framework

Key Findings on Model Performance

Implications for Scientific AI

Comments (0)

No comments yet