Back to AI Research

AI Research

ABC-Bench: An Agentic Bio-Capabilities Benchmark fo... | AI Research

Key Takeaways

  • ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity introduces a new evaluation framework designed to measure how effectively AI agents can perf...
  • Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data.
  • Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists.
  • These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks.
  • To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities.
Paper AbstractExpand

Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity introduces a new evaluation framework designed to measure how effectively AI agents can perform complex biological tasks. As large language models (LLMs) gain the ability to use software tools and laboratory equipment, they are increasingly capable of conducting research that once required highly trained human biologists. This benchmark assesses these "agentic" capabilities across both benign and dual-use tasks, helping researchers understand the potential biosecurity risks associated with advanced AI systems.

Evaluating Agentic Biology

Unlike traditional benchmarks that rely on multiple-choice questions, ABC-Bench tests AI agents in an "agentic scaffold." This means the models are given access to real-world tools—such as Python, bioinformatics software, and web search—to complete multi-step, end-to-end tasks. The benchmark focuses on three specific areas: designing DNA fragments for assembly, finding ways to evade nucleic acid synthesis screening, and writing code to operate liquid handling robots. By using these tasks, the researchers aim to evaluate how well AI can navigate the practical, technical steps involved in molecular biology.

Performance Against Human Experts

The study tested eight frontier AI models and compared their performance against a group of human experts, including molecular biologists and those with significant industry experience. The results showed that all tested AI models outperformed the median human expert across all three benchmark tasks. The models were particularly proficient at tasks involving well-documented protocols, such as operating liquid handling robots and designing DNA fragments. However, they performed less effectively on tasks requiring creative, novel bioinformatics reasoning, such as developing new methods to bypass synthesis screening.

Real-World Laboratory Validation

To determine if these digital capabilities translate to physical results, the researchers conducted a wet-lab validation experiment. Using OpenAI’s o4-mini-high model, they tasked the AI with writing code to control an OpenTrons Flex liquid handling robot to perform a DNA assembly protocol. The model successfully generated functional scripts that, when executed by the robot, resulted in the successful assembly of DNA in all three experimental attempts. This confirmed that AI agents can effectively bridge the gap between digital instructions and physical laboratory execution.

Biosecurity Considerations

The authors emphasize that while these capabilities offer significant potential to accelerate scientific and biomedical discovery, they also present clear biosecurity risks. Because these agents can automate complex, multi-step biological processes, they could potentially be misused by actors to create hazardous biological materials. The researchers argue that as AI capabilities continue to advance, it is essential to develop robust benchmarks and governance frameworks to monitor these tools, ensure safety, and prepare for potential misuse.

Comments (0)

No comments yet

Be the first to share your thoughts!