Back to AI Research

AI Research

BioMiner: A Multi-modal System for Automated Mining... | AI Research

Key Takeaways

  • BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature Drug discovery relies heavily on understanding how spe...
  • Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature.
  • To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction.
  • For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications.
  • BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets.
Paper AbstractExpand

Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.

BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature
Drug discovery relies heavily on understanding how specific molecules (ligands) interact with proteins. While this information is buried in millions of scientific papers, manual extraction is too slow to keep up with the pace of new research. BioMiner is a new multi-modal AI system designed to automate this process. It extracts complex bioactivity data—such as how strongly a drug candidate binds to a target protein—from the text, tables, and figures of scientific literature, effectively turning static documents into structured, usable data for drug discovery.

A Decoupled Approach to Extraction

The primary challenge in extracting this data is that it requires two very different skills: understanding the biochemical context described in a paper and accurately reconstructing the chemical structure of the ligands. Previous attempts to solve this with a single "one-shot" AI model often failed because the tasks are too complex to bundle together.
BioMiner solves this by separating the process into distinct stages. It uses specialized agents to handle document parsing, bioactivity measurement, and chemical structure resolution. By decoupling these tasks, the system can use specific tools for each: it uses semantic reasoning for bioactivity measurements and a "Chemical-Structure-Grounded Visual Semantic Reasoning" (CSG-VSR) paradigm for chemical structures. This allows the system to handle "Markush structures"—complex, shorthand chemical representations that describe groups of related compounds—which are notoriously difficult for standard AI to resolve.

The BioVista Benchmark

To ensure the system is accurate and to provide a standard for future research, the authors introduced BioVista. This is a large-scale, expert-curated benchmark containing over 16,000 bioactivity entries from 500 scientific publications. Because it includes data from text, tables, and figures, it provides a rigorous testing ground for AI models. The benchmark is designed to prevent "cheating" by keeping a large portion of the data hidden from the model during development, ensuring that the reported performance reflects how the system would handle entirely new, unseen research papers.

Real-World Impact

The researchers demonstrated the practical utility of BioMiner through three key applications: * Accelerating Database Building: The system extracted over 82,000 data points from more than 11,000 papers in just three days. When used to pre-train other drug discovery models, this data improved their performance by nearly 4%. * Human-in-the-Loop Workflow: By having the AI perform the heavy lifting and letting human experts verify the results, the team doubled the available high-quality data for the NLRP3 protein. This led to a 38.6% improvement in predictive models and helped identify 16 new potential drug candidates. * Annotation Efficiency: In tests using the PoseBusters dataset, BioMiner-assisted workflows were more than five times faster than manual annotation while simultaneously improving accuracy.
These results suggest that BioMiner can significantly reduce the human effort required to curate scientific data, helping to unlock vast amounts of information that were previously trapped in PDF files.

Comments (0)

No comments yet

Be the first to share your thoughts!