Back to AI Research

AI Research

Toward Scalable Terminal Task Synthesis via Skill G... | AI Research

Key Takeaways

  • Toward Scalable Terminal Task Synthesis via Skill Graphs Terminal agents—AI models capable of executing commands in a computer terminal—are becoming increasi...
  • Terminal agents have demonstrated strong potential for autonomous command-line execution, yet their training remains constrained by the scarcity of high-quality and diverse execution trajectories.
  • Existing approaches mitigate this bottleneck by synthesizing large-scale terminal task instances for trajectory sampling.
  • However, they primarily focus on scaling the number of tasks while providing limited control over the diversity of execution trajectories that agents actually experience during training.
  • In this paper, we present SkillSynth, an automated framework for terminal task synthesis built on a scenario-mediated skill graph.
Paper AbstractExpand

Terminal agents have demonstrated strong potential for autonomous command-line execution, yet their training remains constrained by the scarcity of high-quality and diverse execution trajectories. Existing approaches mitigate this bottleneck by synthesizing large-scale terminal task instances for trajectory sampling. However, they primarily focus on scaling the number of tasks while providing limited control over the diversity of execution trajectories that agents actually experience during training. In this paper, we present SkillSynth, an automated framework for terminal task synthesis built on a scenario-mediated skill graph. SkillSynth first constructs a large-scale skill graph, where scenarios serve as intermediate transition nodes that connect diverse command-line skills. It then samples paths from this graph as abstractions of real-world workflows, and uses a multi-agent harness to instantiate them into executable task instances. By grounding task synthesis in graph-sampled workflow paths, SkillSynth explicitly controls the diversity of minimal execution trajectories required to solve the synthesized tasks. Experiments on Terminal-Bench demonstrate the effectiveness of SkillSynth. Moreover, task instances synthesized by SkillSynth have been adopted to train Hy3 Preview, contributing to its enhanced agentic capabilities in terminal-based settings.

Toward Scalable Terminal Task Synthesis via Skill Graphs
Terminal agents—AI models capable of executing commands in a computer terminal—are becoming increasingly powerful. However, their training is often held back by a lack of high-quality, diverse practice tasks. While researchers have tried to create more tasks by scaling up the number of instances, these methods often produce redundant data that fails to teach agents how to handle a wide variety of real-world workflows. This paper introduces SkillSynth, a framework that automates the creation of diverse terminal tasks by using a "skill graph" to map out how different command-line actions connect to one another.

Mapping Terminal Skills

The core of SkillSynth is a large-scale skill graph. The researchers treat terminal usage as a sequence of "scenarios" (the state of the system) and "skills" (the actions taken to move from one state to another). By collecting thousands of real-world skills from public repositories and defining their preconditions and postconditions, the framework builds a map where scenarios act as nodes and skills act as the paths between them. This structure allows the system to understand the logical flow of complex tasks, rather than just generating random commands.

Generating Diverse Workflows

Once the graph is built, SkillSynth samples paths through it to create "blueprints" for new tasks. To ensure the training data is as diverse as possible, the system uses an inverse-frequency sampling method. This technique intentionally prioritizes less-frequently used skills and scenarios, preventing the model from repeatedly practicing the same types of tasks. These sampled paths are then passed to a multi-agent harness—a system of AI agents that work together to turn these blueprints into fully executable, verified terminal tasks, complete with instructions and testing scripts.

Performance and Results

The framework is highly efficient, producing 3,560 verified, high-quality task instances in a single automated run with a 95.7% success rate. Experiments show that these tasks are significantly more challenging than those created by previous methods, requiring more steps to solve. When used to train models like Qwen3 and Hy3 Preview, the data generated by SkillSynth led to improved performance on standard benchmarks, proving that the diversity of the training data is just as important as the quantity.

Considerations for Future Scaling

While SkillSynth is highly effective, the researchers noted that the quality of the initial task generation is critical. If the initial attempt to create a task results in a corrupted filesystem or a broken environment, it can be difficult to repair even with multiple attempts. Additionally, the framework relies on the quality of the skill graph; as the community contributes more skills, the graph will continue to expand, allowing for the ongoing, automated synthesis of even more complex and varied terminal tasks.

Comments (0)

No comments yet

Be the first to share your thoughts!