Toward Scalable Terminal Task Synthesis via Skill Graphs
Terminal agents—AI models capable of executing commands in a computer terminal—are becoming increasingly powerful. However, their training is often held back by a lack of high-quality, diverse practice tasks. While researchers have tried to create more tasks by scaling up the number of instances, these methods often produce redundant data that fails to teach agents how to handle a wide variety of real-world workflows. This paper introduces SkillSynth, a framework that automates the creation of diverse terminal tasks by using a "skill graph" to map out how different command-line actions connect to one another.
Mapping Terminal Skills
The core of SkillSynth is a large-scale skill graph. The researchers treat terminal usage as a sequence of "scenarios" (the state of the system) and "skills" (the actions taken to move from one state to another). By collecting thousands of real-world skills from public repositories and defining their preconditions and postconditions, the framework builds a map where scenarios act as nodes and skills act as the paths between them. This structure allows the system to understand the logical flow of complex tasks, rather than just generating random commands.
Generating Diverse Workflows
Once the graph is built, SkillSynth samples paths through it to create "blueprints" for new tasks. To ensure the training data is as diverse as possible, the system uses an inverse-frequency sampling method. This technique intentionally prioritizes less-frequently used skills and scenarios, preventing the model from repeatedly practicing the same types of tasks. These sampled paths are then passed to a multi-agent harness—a system of AI agents that work together to turn these blueprints into fully executable, verified terminal tasks, complete with instructions and testing scripts.
Performance and Results
The framework is highly efficient, producing 3,560 verified, high-quality task instances in a single automated run with a 95.7% success rate. Experiments show that these tasks are significantly more challenging than those created by previous methods, requiring more steps to solve. When used to train models like Qwen3 and Hy3 Preview, the data generated by SkillSynth led to improved performance on standard benchmarks, proving that the diversity of the training data is just as important as the quantity.
Considerations for Future Scaling
While SkillSynth is highly effective, the researchers noted that the quality of the initial task generation is critical. If the initial attempt to create a task results in a corrupted filesystem or a broken environment, it can be difficult to repair even with multiple attempts. Additionally, the framework relies on the quality of the skill graph; as the community contributes more skills, the graph will continue to expand, allowing for the ongoing, automated synthesis of even more complex and varied terminal tasks.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!