Scientific machine learning papers often make specific computational claims, such as achieving a certain level of accuracy or successfully modeling physical systems. While coding agents can be prompted to replicate these findings, they often struggle to maintain progress, verify their own work, or distinguish between genuine evidence and copied material. This paper introduces "Paper-replication," a specialized workflow and coding-agent skill designed to ensure that scientific claims are rigorously validated through a persistent, evidence-based process rather than relying on the agent's final summary.
A New Standard for Replication
The core of this approach is to treat paper replication as a "target-level evidence contract." Instead of asking an agent to simply "replicate a paper," the workflow forces the agent to break the paper down into specific, individual claims—called targets. For each target, the agent must create a comprehensive record that includes its reconstruction of the paper’s method, the code used to run the experiment, the resulting data, and a comparison against the original paper’s claims. By requiring this level of detail, the system ensures that the agent cannot claim success without providing a clear, traceable link between the original paper and the newly generated evidence.
How the Workflow Functions
Paper-replication operates through a persistent workspace that acts as a central record-keeper. The agent follows a structured process: it first inventories the paper’s source materials, then populates a "reproduction matrix" to track each target. As the agent works, it must record its assumptions, implementation details, and the provenance of its results in specification files. Crucially, the system includes automated validation checks. These checks act as a "completion gate," ensuring that every target has been addressed, that the agent’s generated outputs are distinct from the original paper’s assets, and that all evidence is properly documented in a final report.
Evaluating Performance
The researchers tested this workflow across twelve independent runs involving four different scientific machine learning papers, covering topics like physics-informed neural networks and dynamical systems. In every instance, the agents successfully reached the completion gate, with all 158 recorded targets matched with verified evidence. The study highlights that even when the final goal is met, different runs can vary significantly in how they decompose the paper into targets, the time taken to complete the work, and the specific rules used to judge whether a result matches the original claim.
Key Takeaways
The study demonstrates that successful replication is more than just generating code; it is about creating a durable, verifiable record of scientific work. By shifting the focus from the agent's final message to the state of the workspace and the quality of the evidence, the Paper-replication workflow provides a more reliable way to assess whether a computational claim holds up. This approach helps address common failure modes in AI research, such as losing track of long-term goals or failing to provide sufficient justification for scientific conclusions.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!