SkillGenBench is a new benchmark designed to evaluate how well AI models can turn raw information—such as software code repositories or long technical documents—into "skills." As AI agents become more advanced, they are increasingly built using reusable procedural artifacts (skills) rather than just relying on simple prompts. This paper argues that current benchmarks focus too much on whether an agent can use a skill, rather than whether the agent can successfully create a correct, executable skill in the first place. SkillGenBench provides a controlled, reproducible environment to test these "skill generation pipelines" as independent components.
How the Benchmark Works
The benchmark treats the skill generator as the primary object of study. It provides the generator with raw source materials and asks it to produce a standardized skill artifact. This artifact is then tested in a fixed, containerized environment to see if it actually works. The benchmark covers two main ways of generating skills:
Task-conditioned generation: The model creates a skill specifically to solve a task it has already been shown.
Task-agnostic generation: The model must distill a library of reusable skills from raw data before it even knows what tasks it will be asked to perform later.
The benchmark also tests two types of source material: "repository-grounded" instances, where procedures are hidden across code and configuration files, and "document-grounded" instances, where procedures must be extracted from long-form text.
Evaluating Performance
To ensure fairness and reproducibility, SkillGenBench uses a strict evaluation protocol. It uses deterministic, execution-based checks to see if the generated skill performs correctly. In cases where there isn't a single "right" answer, it uses artifact-based evaluation, comparing the output of the skill against a reference using methods like semantic similarity or LLM-based judging. The benchmark includes 187 tasks across various domains, all validated to ensure they are challenging enough to require proper procedural extraction.
Key Findings
Experiments conducted across several different AI models revealed significant performance gaps. The researchers found that:
Code is harder than documents: Models consistently struggled more with repository-grounded tasks (10.8%–14.4% success) compared to document-grounded tasks (21.4%–25.0%). This is likely because repository tasks require the model to infer implicit workflows from scattered code and scripts.
Skills aren't always helpful: The study found that automatically generated skills are not universally beneficial. If a skill is poorly generated, it can introduce incorrect assumptions or interface inconsistencies that actually make the agent perform worse than if it had no skill at all.
Method sensitivity: The effectiveness of a skill-generation pipeline depends heavily on the interaction between the specific generation method used and the underlying AI model. No single method was a "silver bullet" across all scenarios.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!