AI Research

Fully Open Meditron: An Auditable Pipeline for Clin... | AI Research

Key Takeaways

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs Clinical decision support systems are increasingly used in healthcare, but the models powering t...
Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation.
Yet current LLM-based CDSS remain largely opaque.
Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior.
Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine.

Paper AbstractExpand

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Clinical decision support systems are increasingly used in healthcare, but the models powering them are often "black boxes." While many models are labeled as "open," they typically only release the final model weights while keeping the training data, curation methods, and development pipelines secret. This lack of transparency makes it difficult for clinicians and researchers to audit how a model learns or whether it is truly safe for patient care. This paper introduces the first "Fully Open" (FO) pipeline for medical AI, providing a complete, end-to-end framework that allows for full auditability and reproducible validation.

A Transparent Medical Pipeline

The researchers created a comprehensive, clinician-audited training corpus by unifying eight public medical datasets and expanding them with three new synthetic data sources. These include exam-style questions, clinical vignettes, and data grounded in over 46,000 clinical practice guidelines. To ensure the highest quality, a panel of four physicians vetted the generation prompts and audited the synthetic outputs. The team also implemented a system-wide "decontamination" process to ensure the models are not simply memorizing answers from the evaluation benchmarks, a common issue that can inflate performance scores.

Evaluating Clinical Reasoning

Standard medical tests often rely on multiple-choice questions, which reward rote memorization rather than actual clinical judgment. To address this, the authors developed "Auto-MOOVE," an automated evaluation protocol that uses an LLM-as-a-judge to assess open-ended clinical reasoning. This judge was calibrated against 204 human raters to ensure it aligns with expert clinical standards. The framework evaluates models on critical dimensions like communication, contextual awareness, and alignment with medical guidelines, providing a more realistic picture of how a model might perform in a real-world clinical setting.

Performance and Impact

The researchers applied their pipeline to five different base models. The results show that the MeditronFO variants consistently outperform their base models across medical benchmarks. Specifically, the Apertus-70B-MeditronFO model established a new state-of-the-art for fully open medical systems. Additionally, the Gemma-3-27B-MeditronFO model was preferred over the existing MedGemma model in over 58% of head-to-head clinical evaluations. These findings demonstrate that it is possible to achieve high-level medical performance while maintaining full transparency, proving that openness does not have to come at the cost of clinical capability.

Comments (0)

No comments yet

Be the first to share your thoughts!