SkillsBench Finds Self-Generated Agent Skills Add No Average Benefit

What the Hacker News thread pointed to

On February 16, 2026, Hacker News surfaced the paper SkillsBench under the headline "Study: Self-generated Agent Skills are useless." At capture time, the HN item showed a score of 217 with 102 comments. The underlying question is practical: when teams attach procedural skill packs to LLM agents, do those packs reliably improve execution, and can models write equally useful skills on their own?

Benchmark design

SkillsBench defines 86 tasks across 11 domains, each paired with curated skills and deterministic verifiers. Every task is evaluated in three settings: no skills, curated skills, and self-generated skills. The authors report 7 agent-model configurations and 7,308 trajectories, which makes the study more than a single demo. It is structured to isolate skill effects from model-only behavior and to measure pass-rate changes under repeatable checks.

Reported findings

Curated skills raise average pass rate by +16.2 percentage points.
Gains vary by domain, from +4.5pp in Software Engineering to +51.9pp in Healthcare.
16 out of 84 tasks show negative deltas, meaning skills can hurt in some cases.
Self-generated skills provide no average benefit.
Focused 2-3 module skills outperform broad documentation-style skill bundles.

Why this matters for agent engineering

The paper suggests that procedural assets are now a first-class performance lever, not just an implementation detail. For production teams, this shifts effort from "long generic playbooks" toward narrow, testable, domain-specific skill modules. It also argues for verifier-first development: define objective checks, run no-skill versus curated-skill comparisons, and retire low-signal skill packs quickly. In short, scaling the base model is only part of the story; maintaining high-quality, validated skills is increasingly central to reliable agent outcomes.

SkillsBench Finds Self-Generated Agent Skills Add No Average Benefit

What the Hacker News thread pointed to

Benchmark design

Reported findings

Why this matters for agent engineering

Related Articles

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months

OpenAI Pauses SWE-bench Verified Evaluations After 16.4% Flaw Finding

Related Articles

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents
LLM Reddit Feb 14, 2026 1 min read

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months
LLM Feb 16, 2026 2 min read

OpenAI Pauses SWE-bench Verified Evaluations After 16.4% Flaw Finding
LLM Reddit Feb 27, 2026 2 min read