Back to AI Research

AI Research

ClinEnv: An Interactive Multi-Stage Long Horizon EH... | AI Research

Key Takeaways

  • ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents introduces a new way to evaluate how well AI models perform as attending physicia...
  • Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty.
  • Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them.
  • We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation.
  • Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses.
Paper AbstractExpand

Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents introduces a new way to evaluate how well AI models perform as attending physicians. While many existing AI benchmarks test medical knowledge through multiple-choice questions, they fail to capture the reality of hospital care, where doctors must gather information over time, handle uncertainty, and make a series of irreversible decisions. ClinEnv simulates this "Longitudinal Inpatient Simulation" by requiring AI models to manage real patient admissions, actively query specialized information agents, and commit to clinical actions that are then verified against actual medical records.

How ClinEnv Works

The benchmark transforms raw patient data from the MIMIC-IV database into structured, multi-stage clinical cases. Each case is broken down into a sequence of stages. At every stage, the AI model starts with limited information and must interact with four specialized agents—representing the patient, a nurse, laboratory results, and medical history—to gather the data it needs. Once the model has gathered information, it must submit its decisions regarding medications, procedures, and diagnoses. These decisions are then scored using a dual framework: one that measures the accuracy of the clinical choices using medical ontologies, and another that evaluates the efficiency and quality of the model's information-gathering process.

Measuring Clinical Competence

Unlike static benchmarks that only look at the final answer, ClinEnv measures both the "what" and the "how" of clinical practice. By requiring models to actively seek information, the benchmark makes the "information-acquisition gap" measurable. It tracks how effectively a model uses its tools and whether it gathers necessary data or wastes resources on redundant queries. This allows researchers to see if a model is making the right decisions for the right reasons, or if it is simply guessing correctly despite a poor diagnostic process.

Key Findings

When testing seven different large language models, the researchers found that even the most capable models struggled, reaching only a 0.31 decision F1 score. A significant performance gap emerged between diagnostic accuracy and management actions: models were much better at identifying discharge diagnoses (0.51 F1) than they were at determining the correct management steps (0.17 F1). Furthermore, the study revealed that as cases progressed, models often continued to issue redundant queries, suggesting that current AI struggle to maintain efficient, long-term clinical reasoning. These results indicate that existing benchmarks may be overestimating how ready AI models are for real-world clinical environments.

Comments (0)

No comments yet

Be the first to share your thoughts!