Back to AI Research

AI Research

Curated AI beats frontier LLMs at pharma asset disc... | AI Research

Key Takeaways

  • Curated AI beats frontier LLMs at pharma asset discovery This paper investigates how to improve the accuracy and comprehensiveness of AI tools used to scout...
  • General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines.
  • All five systems receive the same natural-language query and the same JSON output schema.
  • Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system, at perfect precision and 100% recall against the cross-system union of verified drugs.
  • Curated AI beats frontier LLMs at pharma asset discovery This paper investigates how to improve the accuracy and comprehensiveness of AI tools used to scout pharmaceutical pipelines.
Paper AbstractExpand

General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. We benchmark Gosset -- an AI platform with a chat interface backed by curated target-, modality-, and indication-level drug-asset annotations -- against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. All five systems receive the same natural-language query and the same JSON output schema. Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system, at perfect precision and 100% recall against the cross-system union of verified drugs. The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool, suggesting that each of these systems can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface.

Curated AI beats frontier LLMs at pharma asset discovery
This paper investigates how to improve the accuracy and comprehensiveness of AI tools used to scout pharmaceutical pipelines. While general-purpose Large Language Models (LLMs) are increasingly used to track drug development, they often struggle to identify "long-tail" assets—such as early-stage preclinical programs or niche developments from smaller or international firms. The authors introduce "Gosset," an AI platform that replaces generic web search with a curated index of drug-asset annotations, and compare its performance against four leading frontier LLMs.

The Challenge of the "Long Tail"

When pharma analysts search for drugs targeting specific proteins or conditions, they require high recall (finding every real program) and high precision (avoiding fabricated names). Frontier LLMs perform well for high-profile, late-stage drugs that appear frequently in press releases. However, they often fail to capture the vast majority of the pipeline, which consists of preclinical, academic, and smaller biotech programs. Because these models rely on general web search, they struggle to find information that is sparsely indexed or buried in niche sources, and they are prone to hallucinating when asked to generate exhaustive lists.

How Gosset Works

Gosset functions as a chat interface that, instead of searching the open web, queries a structured, curated index of target, modality, and indication-level drug data. To test its effectiveness, the researchers conducted a head-to-head comparison using ten niche oncology and immunology targets. All systems—Gosset and four frontier LLMs—received the same natural-language queries and were required to output results in the same structured format. The researchers validated the findings through a rigorous three-layer pipeline: deterministic auto-passing for known data, an "LLM-as-a-judge" cross-check, and final sign-off by human experts with pharmaceutical backgrounds.

Key Results

The study found that Gosset significantly outperformed frontier models in identifying drug assets. Across the ten targets, Gosset returned 3.2 times more verified drugs than the best-performing frontier system. While the frontier models were generally accurate (maintaining high precision), they suffered from a major recall gap, missing the majority of the preclinical and early-stage assets that Gosset successfully surfaced. Additionally, because Gosset queries a structured database rather than performing multiple live web searches, it provides answers in a fraction of the time, offering a much faster, more interactive experience for users.

Limitations and Future Directions

The authors note that their "100% recall" metric is limited to the "discoverable universe" of drugs—those traceable to public sources like patents, conferences, and press releases. Programs that remain purely internal or undisclosed are invisible to all systems tested. Furthermore, the study acknowledges that the results may be biased toward targets where Gosset’s index is particularly well-populated. To address the recall gap in other models, the authors suggest that frontier LLMs can be connected to the Gosset index via the Model Context Protocol (MCP). This would allow these models to retain their natural language reasoning capabilities while offloading the task of asset enumeration to a specialized, curated database.

Comments (0)

No comments yet

Be the first to share your thoughts!