Language agents are increasingly using "skills"—structured, procedural instructions distilled from past experiences—to improve their performance without needing to be retrained. While these skills are promising for helping agents adapt to new tasks, there has been little systematic research into whether they actually work, why they sometimes fail, and how the entire process of creating and using them functions. This paper provides a comprehensive, utility-grounded study of the full skill lifecycle: generating experience, extracting skills from that experience, and consuming those skills to solve new tasks.
The Skill Lifecycle
The researchers break the process of agent improvement into three distinct stages. First, in experience generation, a target agent interacts with an environment to create a pool of successful and failed task trajectories. Second, in skill extraction, an extractor model processes these trajectories to distill reusable behavioral insights into structured procedural knowledge. Finally, in skill consumption, the target agent uses these extracted skills to perform new, unseen tasks. By testing this pipeline across five diverse domains—including software engineering, web search, and embodied planning—the study evaluates how well these skills translate into real-world performance gains.
Key Findings on Skill Utility
The study reveals that while model-generated skills are beneficial on average, they are not a guaranteed solution. In 25% of the tested pairings between extractors and target agents, the use of skills actually led to "negative transfer," meaning the agent performed worse than it would have without the skills. Furthermore, the researchers found that a model’s ability to solve tasks does not predict its ability to extract useful skills. A model that is a strong task performer might be a poor extractor, and vice versa, suggesting that skill extraction is a unique capability that requires specific alignment between the extractor, the target agent, and the task domain.
Measuring Success
To better understand these dynamics, the researchers introduced two metrics: Extraction Efficacy (EE), which measures how reliably an extractor produces helpful skills for various targets, and Target Evolvability (TE), which measures how much a specific target agent benefits from skills extracted from its own experience. These metrics demonstrate that skill utility is highly dependent on the specific combination of the model doing the extracting and the model doing the consuming. Because of this, the researchers emphasize that choosing an extractor is a compatibility problem rather than simply selecting the most powerful model available.
Improving Future Skill Extraction
By dissecting the lifecycle stages, the researchers identified specific properties that characterize useful skills and analyzed how the composition of experience pools affects final quality. They translated these insights into a "meta-skill"—a concrete, actionable approach that guides the extraction process toward features tied to actual utility. This meta-skill consistently improves the quality of generated skills across different domains and significantly reduces the occurrence of negative transfer, moving the field toward a more principled and reliable discipline for agent development.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!