Back to AI Research

AI Research

Automating SKILL.md Generation for Computer-Using A... | AI Research

Key Takeaways

  • Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining explores whether we can automatically create "skill libraries"—col...
  • Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies.
  • We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations.
  • The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels.
  • However, readability does not imply transfer.
Paper AbstractExpand

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining explores whether we can automatically create "skill libraries"—collections of reusable, named procedures—by observing how users interact with computer interfaces. Currently, these libraries are typically written by hand, which is a time-consuming bottleneck. By mining patterns from existing interaction data, the researchers hope to make computer-using agents easier to inspect, debug, and improve.

How the Pipeline Works

The researchers developed a three-stage process to turn raw interaction data into structured skills. First, they use a "boundary detector" to cut long sequences of actions into smaller segments based on sudden changes in behavior. Second, they group these segments into clusters using a mathematical approach that compares the statistical distribution of actions within each segment. Finally, they use these clusters to train an AI policy, testing whether the model can learn to compose these skills to complete tasks more effectively.

Readable Structure vs. Useful Skills

The study found that the pipeline is quite successful at identifying coherent, human-readable skills within the source data. For example, the system successfully grouped actions into recognizable categories like "document editing" or "data transfer," with several clusters achieving high purity against ground-truth labels. However, the researchers discovered a critical gap: just because a skill is easy for a human to read or categorize does not mean it is useful for an AI agent. The mined skills did not consistently help the agent perform better on new, unseen tasks.

The Limits of Current Methods

The results highlight significant challenges in moving from "readable" to "transferable" skills. The current boundary detection method often over-splits actions, identifying transitions that aren't actually meaningful, and the system struggles to adapt to new environments where the scale of user actions differs. Furthermore, the researchers found that their trained policies often performed worse than simple statistical baselines, such as choosing the most frequently occurring skill.

A Diagnostic Perspective

The authors present this work as a diagnostic study rather than a finished solution. They conclude that while trajectory mining can successfully expose the underlying structure of how people use computers, the current components—specifically the way boundaries are detected, how segments are represented, and how the reward model is structured—are not yet sufficient to create reliable, cross-domain agents. The findings serve as a roadmap for future research, clarifying which parts of the process are functional and which remain unsolved.

Comments (0)

No comments yet

Be the first to share your thoughts!