Automated reproducibility assessments in the social and behavioral sciences using large language models
Reproducibility is a cornerstone of scientific progress, yet verifying published findings is often a slow, manual, and resource-intensive process. Researchers typically have to reconstruct complex empirical workflows to see if they can arrive at the same results as the original study. This paper explores whether large language models (LLMs) can automate this task, providing a scalable way to audit empirical research in the social and behavioral sciences.
How the Automated Workflow Works
The researchers developed an "agentic" LLM workflow designed to act as an independent analyst. For each study in their sample, the LLM is provided with the original dataset, the specific statistical claim made by the authors, and the full text of the research paper.
Without access to the original authors' code, the LLM is tasked with independently writing and executing its own statistical code in an isolated digital sandbox. The model performs this analysis five times for each study to ensure consistency. The researchers then compare the LLM’s findings—specifically the standardized effect sizes and qualitative conclusions—against both the original published results and the findings from a large-scale human reanalysis project.
Key Findings
The study analyzed 76 published papers from economics, political science, and psychology. After excluding studies where the LLM could not produce a valid estimate, the researchers found that their automated pipeline was highly effective at reaching the same qualitative conclusions as the original authors, matching them in 96% of cases.
When looking at specific effect sizes (using Cohen’s d), the LLM matched the original findings within a strict tolerance range in 41% of the studies. For comparison, human reanalysts in the same dataset matched the original effect sizes in 34% of cases. While the LLM’s continuous effect-size estimates did not always align perfectly with the original reports, the researchers noted that this level of variation is similar to what is observed when humans conduct these same assessments.
Practical Implications and Limitations
The authors suggest that LLMs could serve as a valuable "first-pass" screen for journals, helping to flag studies that require closer human inspection. This could make systematic auditing of the literature more feasible without replacing the need for expert judgment. By automating the initial check, journals could reduce the burden on reviewers while increasing transparency.
However, the researchers highlight several important caveats. First, automated screening should be used as a tool for quality control rather than a gatekeeping mechanism, to avoid penalizing authors for legitimate analytical variations. Second, the study notes that reproducibility is not the same as replicability; while an LLM can check if a result can be recovered from existing data, it cannot determine if that finding would hold true in a new experiment. Finally, the researchers acknowledge that their results may be influenced by factors like training data contamination and the inherent difficulty of standardizing diverse statistical methods into a single metric.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!