A Hacker News post highlighted the SkillsBench paper, which evaluates agent skills across 86 tasks and 11 domains. Curated skills improved average pass rate substantially, while self-generated skills showed no average gain.
#benchmark
RSS FeedLLM Hacker News Feb 17, 2026 1 min read
LLM Feb 16, 2026 2 min read
OpenAI reports that, across more than one million ChatGPT conversations, the share of difficult interactions exceeding a human baseline increased roughly fourfold from September 2024 to January 2026. The company also shows large gains in case-interview and puzzle-style open tasks.
LLM Reddit Feb 14, 2026 1 min read
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.
LLM Feb 13, 2026 1 min read
Anthropic released Claude Opus 4.6, achieving industry-leading performance in coding, long-context retrieval, and knowledge work.