D3-Gym: Constructing Real-World Verifiable Environm...

D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery Scientific research is increasingly powered by language models that can write code to analyze data and test hypotheses. However, these AI agents often struggle because they lack "verifiable environments"—controlled digital spaces where they can run experiments and receive reliable feedback on whether their scientific results are correct. D3-Gym addresses this by providing the first automatically constructed dataset of 565 verifiable scientific tasks, allowing researchers to train and evaluate AI agents in realistic, data-driven settings. Building a Scientific Sandbox Constructing environments for scientific discovery is difficult because, unlike standard software engineering, scientific code often lacks pre-existing tests, and the "correctness" of an output is highly domain-specific. The D3-Gym pipeline solves this by sourcing tasks from hundreds of real scientific repositories. It uses a rigorous filtering process to ensure data integrity and then employs a "planning-then-coding" approach with large language models to generate custom evaluation scripts. These scripts act as automated judges, determining if a model’s scientific analysis is accurate based on specific metrics and acceptance criteria. Validating Scientific Accuracy To ensure these automatically generated evaluation scripts are trustworthy, the researchers compared them against a set of human-annotated "gold standard" scripts created by Ph.D. students. The results showed 87.5% agreement on pass/fail verdicts, confirming that the automated system is scientifically sound. By separating the evaluation process into a high-level planning phase and a technical coding phase, the system successfully captures complex domain-specific logic that simpler methods often miss. Boosting AI Performance The researchers used D3-Gym to train various models from the Qwen3 family using a technique called rejection-sampling fine-tuning. In this process, models learn from their own successful attempts at solving scientific tasks. The training led to consistent performance gains across all model sizes. Notably, the Qwen3-32B model saw its success rate on the ScienceAgentBench benchmark increase by 7.8 absolute points, significantly narrowing the performance gap between open-weight models and strong proprietary alternatives. A Foundation for Future Discovery D3-Gym covers four major scientific disciplines: bioinformatics, computational chemistry, geographic information science, and psychology and cognitive neuroscience. By providing a scalable, verifiable foundation, the project enables more robust training for AI agents. While the current tasks are challenging—with even top-tier models solving only about a third of them—the dataset provides a clear path forward for developing AI that can reliably assist in real-world scientific workflows.