OpenAI has introduced LifeSciBench, a new benchmark designed to evaluate artificial intelligence models on complex, real-world life-science research tasks. Unlike traditional biology benchmarks that rely on narrow, fact-based questions with simple answers, LifeSciBench requires models to navigate the nuanced decision-making processes typical of scientific work. The benchmark consists of 750 expert-authored tasks that span seven biological domains—ranging from genomics and medicinal chemistry to clinical and translational science—and seven distinct workflows, including evidence handling, design, and scientific communication.
A Rigorous Standard for Scientific Reasoning
The benchmark was developed through a collaborative effort involving 173 expert scientists, all of whom hold Ph.D.s and possess experience in the biotechnology or pharmaceutical industries. Each task is designed to mirror a briefing between colleagues and requires free-response answers rather than multiple-choice selections. Approximately 79% of these tasks involve multiple reasoning or decision-making steps, with an average of four steps per task. To ensure high quality, each task underwent multiple automated review cycles and at least two expert reviews, with a separate cohort of 453 reviewers validating the content.
The grading system is built upon a comprehensive rubric containing 19,020 individual criteria. Rather than relying on a single reference string, the benchmark evaluates responses based on concrete properties, such as specific facts, reasoning steps, or numeric answers within a defined tolerance. Performance is measured using two primary metrics: a normalized rubric score representing partial credit and a strict task pass rate, which requires a score of at least 70%.
Model Performance and Limitations
OpenAI evaluated five models using a single-turn setting that permitted unrestricted internet browsing. The results indicate that the benchmark is far from saturated, as even the strongest model, GPT-Rosalind, achieved a pass rate of only 36.1%. Other models performed with lower success rates, including GPT-5.5 at 25.7%, Gemini 3.1 Pro at 23.6%, GPT-5.4 at 20.7%, and Grok 4.3 at 13.0%.
The evaluation revealed specific bottlenecks in AI performance. Models struggled significantly when tasks required the use of artifacts—such as sequences, figures, tables, PDFs, and chemical structures—which are included in over half of the benchmark's tasks. For instance, GPT-Rosalind’s performance dropped from 45.1% on text-only tasks to 28.1% when artifacts were involved. Furthermore, while models demonstrated strength in structured judgment, they often stalled mid-task, with many responses earning partial credit but failing to meet the 70% threshold required for a passing grade.
Despite these challenges, the benchmark provides a detailed framework for assessing how AI models handle the complexities of modern scientific research. The developers have provided an interactive rubric grader to demonstrate how these criteria function, highlighting the gap between partial understanding and the rigorous requirements of professional scientific output.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!