Back to AI Research

AI Research

ProjectionBench: Evaluating Scientific Hypothesis G... | AI Research

Key Takeaways

  • ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure Scientific discovery is a creative and uncertai...
  • Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge.
  • We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test.
  • In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed.
  • Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems.
Paper AbstractExpand

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure
Scientific discovery is a creative and uncertain process that requires more than just recalling facts; it demands the ability to reason through new problems. While many existing benchmarks test how well AI models can answer textbook questions or retrieve information, they often fail to measure a model's capacity for genuine scientific discovery. This paper introduces ProjectionBench, a new framework designed to evaluate how well Large Language Models (LLMs) can generate scientific hypotheses and predict research outcomes, moving from minimal information to full experimental details.

A Progressive Approach to Discovery

The core of ProjectionBench is a "progressive disclosure" framework. Instead of asking a model to solve a problem all at once, the benchmark reveals information in stages. It begins by providing only the research topic and the research question. As the process continues, the model is given additional context, such as the null hypothesis and the specific experimental procedures used in a study. By testing the model at each stage, researchers can assess both the model's "innovativeness"—its ability to make educated guesses with little information—and its "grounded reasoning"—its ability to draw accurate conclusions when provided with full experimental data.

Measuring Accuracy Through Atomic Claims

To grade the models, the researchers developed an automated method that breaks down complex scientific results into "atomic claims." These claims represent specific relationships between variables, such as how an experimental manipulation affects a measured outcome. By comparing these atomic claims against the ground truth from actual published papers, the system calculates a score based on precision and recall. This allows the benchmark to identify if a model is missing key findings or including "extraneous" claims that don't match the actual results. To ensure fairness, the researchers use an LLM-as-a-judge approach, which is calibrated to avoid bias toward any specific model's writing style.

Performance Across Frontier Models

The researchers tested several state-of-the-art models, including GPT-5, GPT-5.4, Gemini 2.5 Pro, and Gemini 3.1 Pro Preview, using 45 recent papers from the fields of bioactive, mechanical, and nanomaterials. The results showed that newer models consistently outperformed their predecessors. Notably, GPT-5.4 demonstrated a strong ability to align with ground-truth conclusions even when provided with minimal context. The study also found that while adding a null hypothesis significantly improved performance, adding further experimental details provided diminishing returns, suggesting that the initial hypothesis is a critical anchor for AI reasoning.

Insights and Limitations

The benchmark revealed that model performance varies significantly by scientific domain. Bioactive materials research saw higher alignment scores, suggesting that current models may have more robust baseline knowledge in this area. In contrast, mechanical materials research proved more difficult, showing a wider range of scores and indicating that this field remains a significant challenge for AI. The researchers also noted that there is a potential trade-off between a model's ability to perform open-ended discovery and its ability to follow structured reasoning. Because the benchmark relies on recent, real-world publications, it provides a scalable way to continuously evaluate how AI systems handle the evolving frontier of scientific knowledge.

Comments (0)

No comments yet

Be the first to share your thoughts!