Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining explores whether we can automatically create "skill libraries"—collections of reusable, named procedures—by observing how users interact with computer interfaces. Currently, these libraries are typically written by hand, which is a time-consuming bottleneck. By mining patterns from existing interaction data, the researchers hope to make computer-using agents easier to inspect, debug, and improve.
How the Pipeline Works
The researchers developed a three-stage process to turn raw interaction data into structured skills. First, they use a "boundary detector" to cut long sequences of actions into smaller segments based on sudden changes in behavior. Second, they group these segments into clusters using a mathematical approach that compares the statistical distribution of actions within each segment. Finally, they use these clusters to train an AI policy, testing whether the model can learn to compose these skills to complete tasks more effectively.
Readable Structure vs. Useful Skills
The study found that the pipeline is quite successful at identifying coherent, human-readable skills within the source data. For example, the system successfully grouped actions into recognizable categories like "document editing" or "data transfer," with several clusters achieving high purity against ground-truth labels. However, the researchers discovered a critical gap: just because a skill is easy for a human to read or categorize does not mean it is useful for an AI agent. The mined skills did not consistently help the agent perform better on new, unseen tasks.
The Limits of Current Methods
The results highlight significant challenges in moving from "readable" to "transferable" skills. The current boundary detection method often over-splits actions, identifying transitions that aren't actually meaningful, and the system struggles to adapt to new environments where the scale of user actions differs. Furthermore, the researchers found that their trained policies often performed worse than simple statistical baselines, such as choosing the most frequently occurring skill.
A Diagnostic Perspective
The authors present this work as a diagnostic study rather than a finished solution. They conclude that while trajectory mining can successfully expose the underlying structure of how people use computers, the current components—specifically the way boundaries are detected, how segments are represented, and how the reward model is structured—are not yet sufficient to create reliable, cross-domain agents. The findings serve as a roadmap for future research, clarifying which parts of the process are functional and which remain unsolved.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!