The OpenThoughts-Agent (OT-Agent) project aims to solve the lack of transparency in how training data is curated for agentic AI models. While many models can now perform complex tasks like using a computer or resolving software issues, the specific "recipes" for the data used to train them remain largely hidden. This project provides a fully open data curation pipeline, conducting over 100 experiments to determine which data strategies lead to the most capable agents.
A Systematic Data Pipeline
To build an effective agent, researchers must curate high-quality pairs of tasks and agent trajectories. The team systematically tested different stages of this process, such as how tasks are generated, how they are mixed, and how to filter the resulting data. They discovered that the choice of "teacher" model—the AI that generates the training examples—is critical; surprisingly, the most powerful model is not always the best teacher. Additionally, they found that filtering out short, low-quality interactions in favor of longer, multi-turn trajectories significantly improves the final model's performance.
Scaling and Diversity
A major challenge in training agents is ensuring they can handle a wide variety of tasks rather than just one. The researchers found that simply adding more data often leads to diminishing returns if the tasks are too similar. To overcome this, they used synthetic augmentation to expand the diversity of their task descriptions. By combining this with a balanced mix of high-quality sources, they created a 100,000-example dataset that shows strong scaling properties, meaning the model continues to get better as the dataset grows.
Performance Gains
The team fine-tuned the Qwen3-32B model using their curated dataset and achieved an average accuracy of 44.8% across seven different agentic benchmarks. This represents a 3.9 percentage point improvement over the strongest existing open-data agentic model, Nemotron-Terminal-32B. Their model demonstrated superior performance on tasks ranging from software engineering to terminal-based system administration, proving that their data pipeline is highly effective for creating broadly capable agents.
Future Research
Beyond supervised fine-tuning, the project also explored data curation for reinforcement learning (RL). By documenting the challenges of existing RL datasets and introducing a new, more systematic approach, the team showed that combining SFT with RL leads to even better performance in smaller 8B models. To encourage further innovation, the researchers have publicly released their training sets, the full data pipeline, and their experimental models, providing the community with the tools to continue investigating how to best train agentic AI.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!