AI News

Harvard Study: AI Models Outperform Doctors in ER Diagnoses

May 4, 2026 • AI Models • AI Research • AI Industry

A Harvard Medical School study found OpenAI's o1 model outperformed internal medicine physicians in emergency room triage, sparking debate on AI in healthcare.

Key Takeaways

Highlights the potential for LLMs to assist in high-stakes clinical triage, where rapid decision-making is critical.
Underscores the ongoing debate regarding AI accountability and the necessity of comparing AI performance against appropriate medical specialists.
Emphasizes that while AI shows promise, it currently lacks the clinical validation and framework required for autonomous life-or-death decision-making.

A new study published in Science suggests that large language models may have a significant role to play in clinical settings, with researchers finding that an AI model outperformed human physicians in diagnosing patients in an emergency room environment. The study, led by a team of physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center, compared the performance of OpenAI’s o1 and 4o models against human doctors using real-world patient data.

Evaluating AI Diagnostic Accuracy

In an experiment involving 76 patients at the Beth Israel emergency room, researchers compared diagnoses provided by two internal medicine attending physicians against those generated by OpenAI’s models. To ensure an unbiased assessment, two other attending physicians evaluated the diagnoses without knowing whether they were produced by a human or an AI.

The results indicated that the o1 model performed on par with or better than the human physicians and the 4o model. The performance gap was most notable during initial emergency room triage, where information is limited and the need for rapid decision-making is high. In these triage cases, the o1 model provided an exact or very close diagnosis 67% of the time, compared to 55% and 50% for the two human physicians, respectively. Arjun Manrai, a lead author and head of an AI lab at Harvard Medical School, noted that the model eclipsed both prior models and the physician baselines across the benchmarks tested.

Limitations and Clinical Context

Despite the promising results, the researchers cautioned that the study does not suggest AI is prepared to make life-or-death decisions in clinical practice. Instead, the findings highlight an urgent need for prospective trials to evaluate these technologies in real-world settings. Furthermore, the study was limited to text-based information, and the authors noted that current foundation models have limitations when reasoning over non-text inputs.

The study has also drawn critical commentary regarding its methodology. Emergency physician Kristen Panthagani noted that the research compared AI to internal medicine physicians rather than emergency room specialists. Panthagani argued that the primary goal of an ER doctor is to identify life-threatening conditions rather than simply guessing an ultimate diagnosis, suggesting that comparing AI to the wrong specialty may lead to overhyped conclusions.

The Future of Accountability

Beyond the diagnostic performance, the study raises questions about the integration of AI into healthcare. Beth Israel doctor and co-author Adam Rodman pointed out that there is currently no formal framework for accountability regarding AI-generated diagnoses.

Rodman emphasized that patients continue to prioritize human guidance for challenging treatment decisions and life-or-death scenarios. As the technology evolves, the medical community remains focused on how to balance the potential of AI tools with the necessity of human oversight and clear professional responsibility.