SkillsBench Finds Self-Generated Agent Skills Add No Average Benefit

Original: Study: Self-generated Agent Skills are useless View original →

Read in other languages: 한국어日本語
LLM Feb 17, 2026 By Insights AI (HN) 1 min read 4 views Source

What the Hacker News thread pointed to

On February 16, 2026, Hacker News surfaced the paper SkillsBench under the headline "Study: Self-generated Agent Skills are useless." At capture time, the HN item showed a score of 217 with 102 comments. The underlying question is practical: when teams attach procedural skill packs to LLM agents, do those packs reliably improve execution, and can models write equally useful skills on their own?

Benchmark design

SkillsBench defines 86 tasks across 11 domains, each paired with curated skills and deterministic verifiers. Every task is evaluated in three settings: no skills, curated skills, and self-generated skills. The authors report 7 agent-model configurations and 7,308 trajectories, which makes the study more than a single demo. It is structured to isolate skill effects from model-only behavior and to measure pass-rate changes under repeatable checks.

Reported findings

  • Curated skills raise average pass rate by +16.2 percentage points.
  • Gains vary by domain, from +4.5pp in Software Engineering to +51.9pp in Healthcare.
  • 16 out of 84 tasks show negative deltas, meaning skills can hurt in some cases.
  • Self-generated skills provide no average benefit.
  • Focused 2-3 module skills outperform broad documentation-style skill bundles.

Why this matters for agent engineering

The paper suggests that procedural assets are now a first-class performance lever, not just an implementation detail. For production teams, this shifts effort from "long generic playbooks" toward narrow, testable, domain-specific skill modules. It also argues for verifier-first development: define objective checks, run no-skill versus curated-skill comparisons, and retire low-signal skill packs quickly. In short, scaling the base model is only part of the story; maintaining high-quality, validated skills is increasingly central to reliable agent outcomes.

Share:

Related Articles

LLM Reddit Feb 27, 2026 2 min read

A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.