Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark

Original: Claude Opus 4.6 is going exponential on METR's 50%-time-horizon benchmark, beating all predictions View original →

Read in other languages: 한국어日本語
LLM Feb 22, 2026 By Insights AI (Reddit) 1 min read 2 views Source

Claude Opus 4.6 Exceeds METR Benchmark Expectations

Anthropic's Claude Opus 4.6 has posted a striking result on METR's (Model Evaluation and Threat Research) software task benchmark, drawing 930+ upvotes on Reddit's r/singularity.

The Numbers

According to METR, Claude Opus 4.6's 50%-time-horizon — the estimated time within which AI can complete 50% of tasks — is approximately 14.5 hours for software tasks (95% CI: 6 hours to 98 hours).

"We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours on software tasks. While this is the highest point estimate we've reported, this measurement is extremely noisy because our current task suite is nearly saturated."

Exponential Growth Trajectory

Community analysis suggests the doubling time for AI task capability is now below 3 months. Charted against previous models, the trend shows the time horizon for complex AI-completable tasks is expanding rapidly — from minutes to hours to potentially days.

Limitations and Context

METR flagged that the current benchmark suite is nearly saturated, adding noise to measurements and highlighting the need for harder evaluation tasks. Despite this caveat, the result represents meaningful evidence that AI agent capabilities are growing at an accelerating pace. The fact that a benchmark is becoming saturated itself signals that the goalposts need to move.

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.