Back to AI Research

AI Research

SkillOpt: Executive Strategy for Self-Evolving Agen... | AI Research

Key Takeaways

  • SkillOpt: Executive Strategy for Self-Evolving Agent Skills The researchers behind SkillOpt address a fundamental challenge in AI agent development: how to r...
  • We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible.
  • A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment.
  • On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code.
  • SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Paper AbstractExpand

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

SkillOpt: Executive Strategy for Self-Evolving Agent Skills
The researchers behind SkillOpt address a fundamental challenge in AI agent development: how to reliably improve an agent’s procedural performance without needing to retrain its underlying model weights. While current methods often rely on manual prompt engineering or uncontrolled self-revision, SkillOpt introduces a systematic, "deep-learning-style" optimizer for agent skills. By treating a skill document as an external, trainable state, the system uses a separate optimizer model to iteratively refine procedural instructions, ensuring that only changes that demonstrably improve performance on held-out data are accepted.

A Controlled Approach to Skill Evolution

SkillOpt functions like a traditional machine learning optimizer but operates entirely in text space. The process begins with a frozen target model executing tasks using a current skill document. The system then analyzes the resulting successes and failures, grouping them into reflection minibatches. An optimizer model proposes specific, bounded edits—such as adding, deleting, or replacing instructions—to the skill document. To maintain stability, these edits are subject to a "textual learning rate" and a validation gate: a candidate skill is only accepted if it improves performance on a held-out validation set. This prevents the agent from adopting harmful or overfitting changes.

Stability and Negative Feedback

A key innovation in SkillOpt is its ability to learn from its own mistakes. The system maintains a "rejected-edit buffer," which records failed attempts and the reasons for their poor performance. This buffer provides negative feedback to the optimizer, ensuring that the model does not repeat ineffective strategies in future iterations. Additionally, the system employs an "epoch-wise slow/meta update," which acts similarly to a momentum term in deep learning. This allows the system to capture long-term, stable improvements across training epochs while keeping the final, deployed skill artifact compact and easy to audit.

Proven Performance and Portability

The researchers evaluated SkillOpt across six benchmarks, seven target models, and three different execution environments, including direct chat, Codex, and Claude Code. In all 52 evaluated scenarios, SkillOpt either outperformed or tied with existing methods, including human-written skills, one-shot prompting, and other automated evolution techniques. Notably, the optimized skills are highly portable; a skill trained in one environment or on one model scale often retains its effectiveness when transferred to different models or execution harnesses. This allows developers to optimize a skill once and deploy it across various agentic systems without additional training.

Practical Implications

The final output of the SkillOpt process is a compact, human-readable file (typically 300 to 2,000 tokens) that serves as a persistent procedural memory for the agent. Because the optimization happens offline and the resulting artifact is just text, there is zero additional inference-time cost when the skill is deployed. This makes SkillOpt a practical, harness-agnostic tool for domain adaptation, enabling agents to improve their tool-use, formatting, and reasoning capabilities through a rigorous, data-driven process that does not require modifying the agent's core model weights.

Comments (0)

No comments yet

Be the first to share your thoughts!