Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark
Original: Claude Opus 4.6 is going exponential on METR's 50%-time-horizon benchmark, beating all predictions View original →
Claude Opus 4.6 Exceeds METR Benchmark Expectations
Anthropic's Claude Opus 4.6 has posted a striking result on METR's (Model Evaluation and Threat Research) software task benchmark, drawing 930+ upvotes on Reddit's r/singularity.
The Numbers
According to METR, Claude Opus 4.6's 50%-time-horizon — the estimated time within which AI can complete 50% of tasks — is approximately 14.5 hours for software tasks (95% CI: 6 hours to 98 hours).
"We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours on software tasks. While this is the highest point estimate we've reported, this measurement is extremely noisy because our current task suite is nearly saturated."
Exponential Growth Trajectory
Community analysis suggests the doubling time for AI task capability is now below 3 months. Charted against previous models, the trend shows the time horizon for complex AI-completable tasks is expanding rapidly — from minutes to hours to potentially days.
Limitations and Context
METR flagged that the current benchmark suite is nearly saturated, adding noise to measurements and highlighting the need for harder evaluation tasks. Despite this caveat, the result represents meaningful evidence that AI agent capabilities are growing at an accelerating pace. The fact that a benchmark is becoming saturated itself signals that the goalposts need to move.
Related Articles
Claude Fable 5 has moved to the top of Artificial Analysis’s GDPval-AA benchmark with a 1932 score. The result puts Anthropic models in three of the top four slots and raises the bar for long-running agentic knowledge work.
Claude Opus 4.8 is showing its strongest early signal in agentic work, not only coding. Artificial Analysis says the model scored 1890 on GDPval-AA, 121 points ahead of GPT-5.5 xhigh.
HN interest centered less on “Claude finds bugs” and more on the shape of a harness security teams can adapt for their own targets.