Back to AI Research

AI Research

From Raw Experience to Skill Consumption: A Systema... | AI Research

Key Takeaways

  • Language agents are increasingly using "skills"—structured, procedural instructions distilled from past experiences—to improve their performance without need...
  • Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience.
  • In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising.
  • They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting.
  • To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains.
Paper AbstractExpand

Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- \textbf{experience generation}, \textbf{skill extraction}, and \textbf{skill consumption} -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emph{meta-skill} that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

Language agents are increasingly using "skills"—structured, procedural instructions distilled from past experiences—to improve their performance without needing to be retrained. While these skills are promising for helping agents adapt to new tasks, there has been little systematic research into whether they actually work, why they sometimes fail, and how the entire process of creating and using them functions. This paper provides a comprehensive, utility-grounded study of the full skill lifecycle: generating experience, extracting skills from that experience, and consuming those skills to solve new tasks.

The Skill Lifecycle

The researchers break the process of agent improvement into three distinct stages. First, in experience generation, a target agent interacts with an environment to create a pool of successful and failed task trajectories. Second, in skill extraction, an extractor model processes these trajectories to distill reusable behavioral insights into structured procedural knowledge. Finally, in skill consumption, the target agent uses these extracted skills to perform new, unseen tasks. By testing this pipeline across five diverse domains—including software engineering, web search, and embodied planning—the study evaluates how well these skills translate into real-world performance gains.

Key Findings on Skill Utility

The study reveals that while model-generated skills are beneficial on average, they are not a guaranteed solution. In 25% of the tested pairings between extractors and target agents, the use of skills actually led to "negative transfer," meaning the agent performed worse than it would have without the skills. Furthermore, the researchers found that a model’s ability to solve tasks does not predict its ability to extract useful skills. A model that is a strong task performer might be a poor extractor, and vice versa, suggesting that skill extraction is a unique capability that requires specific alignment between the extractor, the target agent, and the task domain.

Measuring Success

To better understand these dynamics, the researchers introduced two metrics: Extraction Efficacy (EE), which measures how reliably an extractor produces helpful skills for various targets, and Target Evolvability (TE), which measures how much a specific target agent benefits from skills extracted from its own experience. These metrics demonstrate that skill utility is highly dependent on the specific combination of the model doing the extracting and the model doing the consuming. Because of this, the researchers emphasize that choosing an extractor is a compatibility problem rather than simply selecting the most powerful model available.

Improving Future Skill Extraction

By dissecting the lifecycle stages, the researchers identified specific properties that characterize useful skills and analyzed how the composition of experience pools affects final quality. They translated these insights into a "meta-skill"—a concrete, actionable approach that guides the extraction process toward features tied to actual utility. This meta-skill consistently improves the quality of generated skills across different domains and significantly reduces the occurrence of negative transfer, moving the field toward a more principled and reliable discipline for agent development.

Comments (0)

No comments yet

Be the first to share your thoughts!