Claude Opus 4.6 achieved a 50%-time-horizon of approximately 14.5 hours on METR's software task benchmark — beating all predictions and suggesting a doubling time of under 3 months for AI task capabilities.
#benchmark
RSS FeedAlibaba launched Qwen 3.5 on February 16 under Apache 2.0, featuring 397B parameters with a sparse MoE architecture (17B active), 256K context, and native multimodal capabilities matching leading US proprietary models on key benchmarks.
Anthropic launched Claude Sonnet 4.6 on February 17, offering major upgrades in coding, computer use, and agent planning—now the default model for Free and Pro users at the same $3/$15 per million tokens pricing.
OpenAI published five model-generated submissions to the First Proof math challenge. None were accepted as valid solutions, but the release gives researchers direct evidence of where frontier reasoning systems succeed and fail.
A Hacker News post highlighted the SkillsBench paper, which evaluates agent skills across 86 tasks and 11 domains. Curated skills improved average pass rate substantially, while self-generated skills showed no average gain.
OpenAI reports that, across more than one million ChatGPT conversations, the share of difficult interactions exceeding a human baseline increased roughly fourfold from September 2024 to January 2026. The company also shows large gains in case-interview and puzzle-style open tasks.
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.
Anthropic released Claude Opus 4.6, achieving industry-leading performance in coding, long-context retrieval, and knowledge work.