ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents introduces a new way to evaluate how well AI models perform as attending physicians. While many existing AI benchmarks test medical knowledge through multiple-choice questions, they fail to capture the reality of hospital care, where doctors must gather information over time, handle uncertainty, and make a series of irreversible decisions. ClinEnv simulates this "Longitudinal Inpatient Simulation" by requiring AI models to manage real patient admissions, actively query specialized information agents, and commit to clinical actions that are then verified against actual medical records.
How ClinEnv Works
The benchmark transforms raw patient data from the MIMIC-IV database into structured, multi-stage clinical cases. Each case is broken down into a sequence of stages. At every stage, the AI model starts with limited information and must interact with four specialized agents—representing the patient, a nurse, laboratory results, and medical history—to gather the data it needs. Once the model has gathered information, it must submit its decisions regarding medications, procedures, and diagnoses. These decisions are then scored using a dual framework: one that measures the accuracy of the clinical choices using medical ontologies, and another that evaluates the efficiency and quality of the model's information-gathering process.
Measuring Clinical Competence
Unlike static benchmarks that only look at the final answer, ClinEnv measures both the "what" and the "how" of clinical practice. By requiring models to actively seek information, the benchmark makes the "information-acquisition gap" measurable. It tracks how effectively a model uses its tools and whether it gathers necessary data or wastes resources on redundant queries. This allows researchers to see if a model is making the right decisions for the right reasons, or if it is simply guessing correctly despite a poor diagnostic process.
Key Findings
When testing seven different large language models, the researchers found that even the most capable models struggled, reaching only a 0.31 decision F1 score. A significant performance gap emerged between diagnostic accuracy and management actions: models were much better at identifying discharge diagnoses (0.51 F1) than they were at determining the correct management steps (0.17 F1). Furthermore, the study revealed that as cases progressed, models often continued to issue redundant queries, suggesting that current AI struggle to maintain efficient, long-term clinical reasoning. These results indicate that existing benchmarks may be overestimating how ready AI models are for real-world clinical environments.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!