Back to AI Research

AI Research

TxBench-PP: Analyzing AI Agent Performance on Small... | AI Research

Key Takeaways

  • TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology The researchers behind TxBench-PP have created a new benchmark designed...
  • Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions.
  • TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature.
  • Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically.
  • Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions.
Paper AbstractExpand

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
The researchers behind TxBench-PP have created a new benchmark designed to test how well AI agents can handle the complex, data-driven decisions required in small-molecule drug discovery. While AI is often touted as a tool to speed up pharmaceutical research, its practical use depends on its ability to accurately interpret real-world experimental data. This benchmark evaluates whether AI agents can reach correct scientific conclusions based on provided assay data, rather than simply relying on memorized facts from scientific literature.

Evaluating Real-World Scientific Decisions

TxBench-PP consists of 100 distinct evaluations that mirror the actual decision-making points scientists face in a laboratory setting. These tasks cover a wide range of preclinical pharmacology, including mechanism-of-action reasoning, target engagement, safety assessments, and translational efficacy.
To ensure the benchmark remains rigorous, agents are provided with realistic workflow snapshots and files in a coding environment. They must then return structured answers that are graded deterministically. The tasks are specifically designed to penalize systems that attempt to "cheat" by using pre-trained knowledge, forcing the agents to perform actual data analysis to arrive at the correct answer.

Performance of Current AI Systems

The study tested 16 different model-harness configurations, totaling 4,800 individual agent trajectories. The results indicate that current AI systems are not yet reliable enough for autonomous use in preclinical pharmacology. Even the top-performing configuration, Claude Opus 4.8 paired with the Pi harness, only passed 59.3% of its attempts.
The researchers found that while models often engage with the data and perform plausible analyses, they frequently stumble due to specific scientific errors. These include failing to apply necessary quality control, making incorrect statistical choices, or misinterpreting biological context. In some cases, models even discarded data-supported findings in favor of incorrect, memorized textbook facts.

The Role of Implementation and Complexity

A key finding of the research is that performance is not just a result of the AI model itself, but also the "harness"—the system that manages the agent's tools and environment. When comparing the same model across different harnesses, researchers observed significant differences in pass rates, suggesting that how an agent is implemented is just as important as the underlying model architecture.
Furthermore, the study highlighted that overall accuracy scores do not necessarily predict how well an agent will perform on high-stakes "go/no-go" advancement decisions. Even the most accurate models struggled with tasks that required selecting the correct set of candidates from a group, often advancing unsafe compounds or incorrectly discarding promising ones.

Future Directions

TxBench-PP is intended as a starting point for a broader effort called TherapeuticsBench. The authors emphasize that this release is limited to small-molecule preclinical pharmacology and should not be used to judge an agent's capability in other areas, such as clinical trials or different therapeutic modalities like biologics or gene therapies. Future work aims to expand the benchmark to cover these additional stages and modalities, providing a more comprehensive roadmap for evaluating AI in the drug discovery ecosystem.

Comments (0)

No comments yet

Be the first to share your thoughts!