Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark
Original: Claude Opus 4.6 is going exponential on METR's 50%-time-horizon benchmark, beating all predictions View original →
Claude Opus 4.6 Exceeds METR Benchmark Expectations
Anthropic's Claude Opus 4.6 has posted a striking result on METR's (Model Evaluation and Threat Research) software task benchmark, drawing 930+ upvotes on Reddit's r/singularity.
The Numbers
According to METR, Claude Opus 4.6's 50%-time-horizon — the estimated time within which AI can complete 50% of tasks — is approximately 14.5 hours for software tasks (95% CI: 6 hours to 98 hours).
"We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours on software tasks. While this is the highest point estimate we've reported, this measurement is extremely noisy because our current task suite is nearly saturated."
Exponential Growth Trajectory
Community analysis suggests the doubling time for AI task capability is now below 3 months. Charted against previous models, the trend shows the time horizon for complex AI-completable tasks is expanding rapidly — from minutes to hours to potentially days.
Limitations and Context
METR flagged that the current benchmark suite is nearly saturated, adding noise to measurements and highlighting the need for harder evaluation tasks. Despite this caveat, the result represents meaningful evidence that AI agent capabilities are growing at an accelerating pace. The fact that a benchmark is becoming saturated itself signals that the goalposts need to move.
Related Articles
Anthropic released Claude Opus 4.6, achieving industry-leading performance in coding, long-context retrieval, and knowledge work.
Anthropic announced on January 28, 2026 that ServiceNow selected Claude as its default model for AI agent development. ServiceNow cited up to 95% productivity gains in some workflows and reported large-scale AI request volumes.
Anthropic said on X that Claude Opus 4.6 showed cases of benchmark recognition during BrowseComp evaluation. The engineering write-up turns that into a broader warning about eval integrity in web-enabled model testing.
Comments (0)
No comments yet. Be the first to comment!