Back to AI Research

AI Research

Coding-agents can replicate scientific machine lear... | AI Research

Key Takeaways

  • Scientific machine learning papers often make specific computational claims, such as achieving a certain level of accuracy or successfully modeling physical...
  • Scientific machine learning papers typically make computational claims, e.g., that the relative mean square error is less than 5% or that the 95% predictive credible interval covers the test data.
  • A coding agent can be prompted to replicate those claims from paper materials alone, but the prompt does not by itself reliably preserve progress or check whether generated evidence supports the paper's claims.
  • We introduce Paper-replication, a workflow that makes each selected paper claim a target with recorded evidence, and implement it as a coding-agent skill.
  • We evaluate Paper-replication on twelve independent runs across four scientific machine learning papers.
Paper AbstractExpand

Scientific machine learning papers typically make computational claims, e.g., that the relative mean square error is less than 5% or that the 95% predictive credible interval covers the test data. A coding agent can be prompted to replicate those claims from paper materials alone, but the prompt does not by itself reliably preserve progress or check whether generated evidence supports the paper's claims. We introduce Paper-replication, a workflow that makes each selected paper claim a target with recorded evidence, and implement it as a coding-agent skill. The workflow makes the agent record those targets, reconstruct the paper's method, run computational experiments, link generated outputs to provenance and comparisons with the paper's claims, record where matched evidence appears in the replication report, and pass validation checks before completion. We evaluate Paper-replication on twelve independent runs across four scientific machine learning papers. All twelve workspaces pass the completion gate, and all 158 recorded targets are matched with report coverage. Even in this completed workspace state, repeated runs differ in how papers are divided into targets, in numerical fidelity to the source papers, in elapsed replication time, in the number of intermediate executions replaced before final evidence is accepted, and in the rules used to accept evidence. Paper-replication makes completion depend on workspace evidence and validation checks rather than on the agent's final message.

Scientific machine learning papers often make specific computational claims, such as achieving a certain level of accuracy or successfully modeling physical systems. While coding agents can be prompted to replicate these findings, they often struggle to maintain progress, verify their own work, or distinguish between genuine evidence and copied material. This paper introduces "Paper-replication," a specialized workflow and coding-agent skill designed to ensure that scientific claims are rigorously validated through a persistent, evidence-based process rather than relying on the agent's final summary.

A New Standard for Replication

The core of this approach is to treat paper replication as a "target-level evidence contract." Instead of asking an agent to simply "replicate a paper," the workflow forces the agent to break the paper down into specific, individual claims—called targets. For each target, the agent must create a comprehensive record that includes its reconstruction of the paper’s method, the code used to run the experiment, the resulting data, and a comparison against the original paper’s claims. By requiring this level of detail, the system ensures that the agent cannot claim success without providing a clear, traceable link between the original paper and the newly generated evidence.

How the Workflow Functions

Paper-replication operates through a persistent workspace that acts as a central record-keeper. The agent follows a structured process: it first inventories the paper’s source materials, then populates a "reproduction matrix" to track each target. As the agent works, it must record its assumptions, implementation details, and the provenance of its results in specification files. Crucially, the system includes automated validation checks. These checks act as a "completion gate," ensuring that every target has been addressed, that the agent’s generated outputs are distinct from the original paper’s assets, and that all evidence is properly documented in a final report.

Evaluating Performance

The researchers tested this workflow across twelve independent runs involving four different scientific machine learning papers, covering topics like physics-informed neural networks and dynamical systems. In every instance, the agents successfully reached the completion gate, with all 158 recorded targets matched with verified evidence. The study highlights that even when the final goal is met, different runs can vary significantly in how they decompose the paper into targets, the time taken to complete the work, and the specific rules used to judge whether a result matches the original claim.

Key Takeaways

The study demonstrates that successful replication is more than just generating code; it is about creating a durable, verifiable record of scientific work. By shifting the focus from the agent's final message to the state of the workspace and the quality of the evidence, the Paper-replication workflow provides a more reliable way to assess whether a computational claim holds up. This approach helps address common failure modes in AI research, such as losing track of long-term goals or failing to provide sufficient justification for scientific conclusions.

Comments (0)

No comments yet

Be the first to share your thoughts!