ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure
Scientific discovery is a creative and uncertain process that requires more than just recalling facts; it demands the ability to reason through new problems. While many existing benchmarks test how well AI models can answer textbook questions or retrieve information, they often fail to measure a model's capacity for genuine scientific discovery. This paper introduces ProjectionBench, a new framework designed to evaluate how well Large Language Models (LLMs) can generate scientific hypotheses and predict research outcomes, moving from minimal information to full experimental details.
A Progressive Approach to Discovery
The core of ProjectionBench is a "progressive disclosure" framework. Instead of asking a model to solve a problem all at once, the benchmark reveals information in stages. It begins by providing only the research topic and the research question. As the process continues, the model is given additional context, such as the null hypothesis and the specific experimental procedures used in a study. By testing the model at each stage, researchers can assess both the model's "innovativeness"—its ability to make educated guesses with little information—and its "grounded reasoning"—its ability to draw accurate conclusions when provided with full experimental data.
Measuring Accuracy Through Atomic Claims
To grade the models, the researchers developed an automated method that breaks down complex scientific results into "atomic claims." These claims represent specific relationships between variables, such as how an experimental manipulation affects a measured outcome. By comparing these atomic claims against the ground truth from actual published papers, the system calculates a score based on precision and recall. This allows the benchmark to identify if a model is missing key findings or including "extraneous" claims that don't match the actual results. To ensure fairness, the researchers use an LLM-as-a-judge approach, which is calibrated to avoid bias toward any specific model's writing style.
Performance Across Frontier Models
The researchers tested several state-of-the-art models, including GPT-5, GPT-5.4, Gemini 2.5 Pro, and Gemini 3.1 Pro Preview, using 45 recent papers from the fields of bioactive, mechanical, and nanomaterials. The results showed that newer models consistently outperformed their predecessors. Notably, GPT-5.4 demonstrated a strong ability to align with ground-truth conclusions even when provided with minimal context. The study also found that while adding a null hypothesis significantly improved performance, adding further experimental details provided diminishing returns, suggesting that the initial hypothesis is a critical anchor for AI reasoning.
Insights and Limitations
The benchmark revealed that model performance varies significantly by scientific domain. Bioactive materials research saw higher alignment scores, suggesting that current models may have more robust baseline knowledge in this area. In contrast, mechanical materials research proved more difficult, showing a wider range of scores and indicating that this field remains a significant challenge for AI. The researchers also noted that there is a potential trade-off between a model's ability to perform open-ended discovery and its ability to follow structured reasoning. Because the benchmark relies on recent, real-world publications, it provides a scalable way to continuously evaluate how AI systems handle the evolving frontier of scientific knowledge.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!