Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Clinical decision support systems are increasingly used in healthcare, but the models powering them are often "black boxes." While many models are labeled as "open," they typically only release the final model weights while keeping the training data, curation methods, and development pipelines secret. This lack of transparency makes it difficult for clinicians and researchers to audit how a model learns or whether it is truly safe for patient care. This paper introduces the first "Fully Open" (FO) pipeline for medical AI, providing a complete, end-to-end framework that allows for full auditability and reproducible validation.
A Transparent Medical Pipeline
The researchers created a comprehensive, clinician-audited training corpus by unifying eight public medical datasets and expanding them with three new synthetic data sources. These include exam-style questions, clinical vignettes, and data grounded in over 46,000 clinical practice guidelines. To ensure the highest quality, a panel of four physicians vetted the generation prompts and audited the synthetic outputs. The team also implemented a system-wide "decontamination" process to ensure the models are not simply memorizing answers from the evaluation benchmarks, a common issue that can inflate performance scores.
Evaluating Clinical Reasoning
Standard medical tests often rely on multiple-choice questions, which reward rote memorization rather than actual clinical judgment. To address this, the authors developed "Auto-MOOVE," an automated evaluation protocol that uses an LLM-as-a-judge to assess open-ended clinical reasoning. This judge was calibrated against 204 human raters to ensure it aligns with expert clinical standards. The framework evaluates models on critical dimensions like communication, contextual awareness, and alignment with medical guidelines, providing a more realistic picture of how a model might perform in a real-world clinical setting.
Performance and Impact
The researchers applied their pipeline to five different base models. The results show that the MeditronFO variants consistently outperform their base models across medical benchmarks. Specifically, the Apertus-70B-MeditronFO model established a new state-of-the-art for fully open medical systems. Additionally, the Gemma-3-27B-MeditronFO model was preferred over the existing MedGemma model in over 58% of head-to-head clinical evaluations. These findings demonstrate that it is possible to achieve high-level medical performance while maintaining full transparency, proving that openness does not have to come at the cost of clinical capability.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!