#benchmark

LLM Reddit Feb 22, 2026 1 min read

Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark

Claude Opus 4.6 achieved a 50%-time-horizon of approximately 14.5 hours on METR's software task benchmark — beating all predictions and suggesting a doubling time of under 3 months for AI task capabilities.

#claude #anthropic #metr

LLM Feb 22, 2026 1 min read

Alibaba Releases Qwen 3.5 Open-Source Model Claiming Frontier-Level Performance

Alibaba launched Qwen 3.5 on February 16 under Apache 2.0, featuring 397B parameters with a sparse MoE architecture (17B active), 256K context, and native multimodal capabilities matching leading US proprietary models on key benchmarks.

#alibaba #qwen #open-source

LLM Feb 22, 2026 1 min read

Anthropic Releases Claude Sonnet 4.6 as New Default Model With 1M Token Context

Anthropic launched Claude Sonnet 4.6 on February 17, offering major upgrades in coding, computer use, and agent planning—now the default model for Free and Pro users at the same $3/$15 per million tokens pricing.

#anthropic #claude #product-launch

LLM Feb 20, 2026 2 min read

OpenAI publishes First Proof model submissions

OpenAI published five model-generated submissions to the First Proof math challenge. None were accepted as valid solutions, but the release gives researchers direct evidence of where frontier reasoning systems succeed and fail.

#openai #reasoning #math

LLM Hacker News Feb 17, 2026 1 min read

SkillsBench Finds Self-Generated Agent Skills Add No Average Benefit

A Hacker News post highlighted the SkillsBench paper, which evaluates agent skills across 86 tasks and 11 domains. Curated skills improved average pass rate substantially, while self-generated skills showed no average gain.

#llm-agents #benchmark #evaluation

LLM Feb 16, 2026 2 min read

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months

OpenAI reports that, across more than one million ChatGPT conversations, the share of difficult interactions exceeding a human baseline increased roughly fourfold from September 2024 to January 2026. The company also shows large gains in case-interview and puzzle-style open tasks.

#openai #chatgpt #reasoning

LLM Reddit Feb 14, 2026 1 min read

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents

A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.

#benchmark #coding-agents #swe-bench

LLM Feb 13, 2026 1 min read

Anthropic Launches Claude Opus 4.6, Outperforms GPT-5.2

Anthropic released Claude Opus 4.6, achieving industry-leading performance in coding, long-context retrieval, and knowledge work.

#anthropic #claude #llm