Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark

Claude Opus 4.6 Exceeds METR Benchmark Expectations

Anthropic's Claude Opus 4.6 has posted a striking result on METR's (Model Evaluation and Threat Research) software task benchmark, drawing 930+ upvotes on Reddit's r/singularity.

The Numbers

According to METR, Claude Opus 4.6's 50%-time-horizon — the estimated time within which AI can complete 50% of tasks — is approximately 14.5 hours for software tasks (95% CI: 6 hours to 98 hours).

"We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours on software tasks. While this is the highest point estimate we've reported, this measurement is extremely noisy because our current task suite is nearly saturated."

Exponential Growth Trajectory

Community analysis suggests the doubling time for AI task capability is now below 3 months. Charted against previous models, the trend shows the time horizon for complex AI-completable tasks is expanding rapidly — from minutes to hours to potentially days.

Limitations and Context

METR flagged that the current benchmark suite is nearly saturated, adding noise to measurements and highlighting the need for harder evaluation tasks. Despite this caveat, the result represents meaningful evidence that AI agent capabilities are growing at an accelerating pace. The fact that a benchmark is becoming saturated itself signals that the goalposts need to move.

LLM Hacker News 4d ago 2 min read

OpenClaw Puts Claude CLI Reuse Back on the Table, and HN Wants Clearer Anthropic Policy

Hacker News focused on the ambiguity around Claude CLI reuse: even if OpenClaw now treats the path as allowed, developers still want a clearer boundary between subscription, CLI, and API usage.

#anthropic #claude #openclaw

LLM 2d ago 2 min read

Anthropic stress-tests Claude for elections, hits 100% and 99.8%

Anthropic put hard numbers behind Claude’s election safeguards. Opus 4.7 and Sonnet 4.6 responded appropriately 100% and 99.8% of the time in a 600-prompt election-policy test, and triggered web search 92% and 95% of the time on U.S. midterm-related queries.

#anthropic #claude #elections

LLM 1d ago 2 min read

Claude agents closed 186 office deals in Anthropic's market test

Why it matters: AI agents are moving from chat demos into delegated economic work. In Anthropic’s office-market experiment, 69 agents closed 186 deals across more than 500 listings and moved a little over $4,000 in goods.

#anthropic #claude #agents