TxBench-PP: Analyzing AI Agent Performance on Small...

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
The researchers behind TxBench-PP have created a new benchmark designed to test how well AI agents can handle the complex, data-driven decisions required in small-molecule drug discovery. While AI is often touted as a tool to speed up pharmaceutical research, its practical use depends on its ability to accurately interpret real-world experimental data. This benchmark evaluates whether AI agents can reach correct scientific conclusions based on provided assay data, rather than simply relying on memorized facts from scientific literature.

Evaluating Real-World Scientific Decisions

TxBench-PP consists of 100 distinct evaluations that mirror the actual decision-making points scientists face in a laboratory setting. These tasks cover a wide range of preclinical pharmacology, including mechanism-of-action reasoning, target engagement, safety assessments, and translational efficacy.
To ensure the benchmark remains rigorous, agents are provided with realistic workflow snapshots and files in a coding environment. They must then return structured answers that are graded deterministically. The tasks are specifically designed to penalize systems that attempt to "cheat" by using pre-trained knowledge, forcing the agents to perform actual data analysis to arrive at the correct answer.

Performance of Current AI Systems

The study tested 16 different model-harness configurations, totaling 4,800 individual agent trajectories. The results indicate that current AI systems are not yet reliable enough for autonomous use in preclinical pharmacology. Even the top-performing configuration, Claude Opus 4.8 paired with the Pi harness, only passed 59.3% of its attempts.
The researchers found that while models often engage with the data and perform plausible analyses, they frequently stumble due to specific scientific errors. These include failing to apply necessary quality control, making incorrect statistical choices, or misinterpreting biological context. In some cases, models even discarded data-supported findings in favor of incorrect, memorized textbook facts.

The Role of Implementation and Complexity

A key finding of the research is that performance is not just a result of the AI model itself, but also the "harness"—the system that manages the agent's tools and environment. When comparing the same model across different harnesses, researchers observed significant differences in pass rates, suggesting that how an agent is implemented is just as important as the underlying model architecture.
Furthermore, the study highlighted that overall accuracy scores do not necessarily predict how well an agent will perform on high-stakes "go/no-go" advancement decisions. Even the most accurate models struggled with tasks that required selecting the correct set of candidates from a group, often advancing unsafe compounds or incorrectly discarding promising ones.

Future Directions

TxBench-PP is intended as a starting point for a broader effort called TherapeuticsBench. The authors emphasize that this release is limited to small-molecule preclinical pharmacology and should not be used to judge an agent's capability in other areas, such as clinical trials or different therapeutic modalities like biologics or gene therapies. Future work aims to expand the benchmark to cover these additional stages and modalities, providing a more comprehensive roadmap for evaluating AI in the drug discovery ecosystem.

TxBench-PP: Analyzing AI Agent Performance on Small... | AI Research

Key Takeaways

Evaluating Real-World Scientific Decisions

Performance of Current AI Systems

The Role of Implementation and Complexity

Future Directions

Comments (0)

No comments yet