Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark
Original: Claude Opus 4.6 is going exponential on METR's 50%-time-horizon benchmark, beating all predictions View original →
Claude Opus 4.6 Exceeds METR Benchmark Expectations
Anthropic's Claude Opus 4.6 has posted a striking result on METR's (Model Evaluation and Threat Research) software task benchmark, drawing 930+ upvotes on Reddit's r/singularity.
The Numbers
According to METR, Claude Opus 4.6's 50%-time-horizon — the estimated time within which AI can complete 50% of tasks — is approximately 14.5 hours for software tasks (95% CI: 6 hours to 98 hours).
"We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours on software tasks. While this is the highest point estimate we've reported, this measurement is extremely noisy because our current task suite is nearly saturated."
Exponential Growth Trajectory
Community analysis suggests the doubling time for AI task capability is now below 3 months. Charted against previous models, the trend shows the time horizon for complex AI-completable tasks is expanding rapidly — from minutes to hours to potentially days.
Limitations and Context
METR flagged that the current benchmark suite is nearly saturated, adding noise to measurements and highlighting the need for harder evaluation tasks. Despite this caveat, the result represents meaningful evidence that AI agent capabilities are growing at an accelerating pace. The fact that a benchmark is becoming saturated itself signals that the goalposts need to move.
Related Articles
Hacker News focused on the ambiguity around Claude CLI reuse: even if OpenClaw now treats the path as allowed, developers still want a clearer boundary between subscription, CLI, and API usage.
Anthropic put hard numbers behind Claude’s election safeguards. Opus 4.7 and Sonnet 4.6 responded appropriately 100% and 99.8% of the time in a 600-prompt election-policy test, and triggered web search 92% and 95% of the time on U.S. midterm-related queries.
Why it matters: AI agents are moving from chat demos into delegated economic work. In Anthropic’s office-market experiment, 69 agents closed 186 deals across more than 500 listings and moved a little over $4,000 in goods.
Comments (0)
No comments yet. Be the first to comment!