Back to AI Research

AI Research

OpenThoughts-Agent: Data Recipes for Agentic Models | AI Research

Key Takeaways

  • The OpenThoughts-Agent (OT-Agent) project aims to solve the lack of transparency in how training data is curated for agentic AI models.
  • Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents.
  • Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks.
  • The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models.
  • We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity.
Paper AbstractExpand

Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at this http URL to support future open research on agentic model training.

The OpenThoughts-Agent (OT-Agent) project aims to solve the lack of transparency in how training data is curated for agentic AI models. While many models can now perform complex tasks like using a computer or resolving software issues, the specific "recipes" for the data used to train them remain largely hidden. This project provides a fully open data curation pipeline, conducting over 100 experiments to determine which data strategies lead to the most capable agents.

A Systematic Data Pipeline

To build an effective agent, researchers must curate high-quality pairs of tasks and agent trajectories. The team systematically tested different stages of this process, such as how tasks are generated, how they are mixed, and how to filter the resulting data. They discovered that the choice of "teacher" model—the AI that generates the training examples—is critical; surprisingly, the most powerful model is not always the best teacher. Additionally, they found that filtering out short, low-quality interactions in favor of longer, multi-turn trajectories significantly improves the final model's performance.

Scaling and Diversity

A major challenge in training agents is ensuring they can handle a wide variety of tasks rather than just one. The researchers found that simply adding more data often leads to diminishing returns if the tasks are too similar. To overcome this, they used synthetic augmentation to expand the diversity of their task descriptions. By combining this with a balanced mix of high-quality sources, they created a 100,000-example dataset that shows strong scaling properties, meaning the model continues to get better as the dataset grows.

Performance Gains

The team fine-tuned the Qwen3-32B model using their curated dataset and achieved an average accuracy of 44.8% across seven different agentic benchmarks. This represents a 3.9 percentage point improvement over the strongest existing open-data agentic model, Nemotron-Terminal-32B. Their model demonstrated superior performance on tasks ranging from software engineering to terminal-based system administration, proving that their data pipeline is highly effective for creating broadly capable agents.

Future Research

Beyond supervised fine-tuning, the project also explored data curation for reinforcement learning (RL). By documenting the challenges of existing RL datasets and introducing a new, more systematic approach, the team showed that combining SFT with RL leads to even better performance in smaller 8B models. To encourage further innovation, the researchers have publicly released their training sets, the full data pipeline, and their experimental models, providing the community with the tools to continue investigating how to best train agentic AI.

Comments (0)

No comments yet

Be the first to share your thoughts!