METR follow-up: from “20% slowdown” to possible AI speedup for expert developers

A r/singularity thread is circulating METR’s latest note, uplift update, on how AI tools affect software productivity. The topic matters because METR’s earlier study became a common reference for the claim that AI assistance could slow down experienced developers in realistic open-source work.

What METR now reports

METR reiterates the original finding: in data from February to June 2025, experienced open-source developers were around 20% slower on study tasks when AI tools were allowed. In the follow-up, the team says that late-2025/early-2026 conditions look different. Broader adoption of agentic tools such as Claude Code and Codex appears to have changed developer behavior and task selection.

The update includes raw estimates that move toward speedup: for returning developers, METR reports an estimated -18% effect (confidence interval -38% to +9%); for newly recruited developers, an estimated -4% effect (confidence interval -15% to +9%). In their framing, negative values indicate speedup. That shift is directionally important, but METR repeatedly warns that this signal is noisy.

Why uncertainty remains high

Selection effects increased: surveyed developers reported skipping tasks they did not want to do without AI (30% to 50%).
Recruitment incentives changed: pay dropped from $150/hour in the original study to $50/hour in the follow-up.
Measurement became harder: multi-agent workflows made time tracking less reliable.

METR’s interpretation is nuanced: productivity uplift is now likely higher than in early 2025, but current experimental design may systematically miss the tasks and developers with the biggest AI gains. So this is not a clean “AI is definitely X% faster” conclusion. It is a methodological warning plus a directional update.

For engineering teams, the practical takeaway is to instrument productivity locally instead of importing a single external headline number. Segment by task type, tool stack, review quality, and completion criteria. The Reddit discussion is useful because it surfaces that measurement design is now as important as model capability when evaluating AI coding impact.

METR follow-up: from “20% slowdown” to possible AI speedup for expert developers

What METR now reports

Why uncertainty remains high

Related Articles

Anthropic stress-tests Claude for elections, hits 100% and 99.8%

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

Anthropic revisits a multi-agent Claude harness for long-running software engineering

Comments (0)

Leave a Comment

Related Articles

Anthropic stress-tests Claude for elections, hits 100% and 99.8%

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

Anthropic revisits a multi-agent Claude harness for long-running software engineering
LLM sources.twitter Mar 28, 2026 2 min read