EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
The researchers behind EpiBench have introduced a new benchmark designed to test how well AI agents can perform complex epigenomics analysis. While AI agents have shown promise in handling biological data, they often struggle with the nuanced scientific judgment required to interpret epigenomic assays—such as ATAC-seq or ChIP-seq—accurately. EpiBench provides a standardized, verifiable way to measure whether these agents can move beyond simple file manipulation to make correct, evidence-based analytical decisions.
How the Benchmark Works
EpiBench consists of 106 distinct evaluation tasks derived from real-world epigenomics workflows. Each task presents the agent with a specific "snapshot" of a workflow state, including the necessary files, metadata, and a clear goal. The benchmark covers a range of essential activities, such as quality control, peak calling, and genomic annotation. To ensure the results are objective, the researchers designed the benchmark so that every task has a deterministically gradable answer. This allows the system to verify if an agent successfully recovered the correct empirical result, rather than just producing a plausible-sounding summary.
Key Findings
The study evaluated 16 different model-harness pairs across 5,088 total attempts. The results reveal a significant performance gap: no system was able to pass a majority of the tasks. The top-performing system, GPT-5.5 using the Pi coding harness, achieved a pass rate of 45.0%. Other leading models performed similarly, with pass rates generally falling between 30% and 40%. The researchers noted that while agents are often capable of finding the correct files and performing intermediate calculations, they frequently fail when the task requires deeper, assay-specific scientific reasoning.
Why Agents Struggle
A detailed review of the failures suggests that AI agents often struggle to distinguish between generic bioinformatics conventions and the specific requirements of the data at hand. In many instances, agents successfully performed the technical steps but ultimately submitted an incorrect answer because they relied on a "literature prior" or a standard workflow default that was not supported by the specific evidence in the provided files. In some cases, the correct answer was actually present in the agent’s own output, but the model replaced it with a less accurate, more familiar default.
Important Considerations
While EpiBench provides a rigorous test for AI in epigenomics, the authors note that it is not an exhaustive measure of scientific reasoning. The benchmark is currently weighted toward specific task types, such as downstream analysis and quality control, and the distribution of assay types is not perfectly balanced. Furthermore, because the benchmark uses deterministic grading to ensure accuracy, it does not account for every possible scientifically valid path an expert human might take. As such, the researchers view EpiBench as a starting point for improving how AI agents ground their biological claims in empirical evidence.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!