METR follow-up: from “20% slowdown” to possible AI speedup for expert developers
Original: METR follows up on often cited study from last year on 20% developer slowdown in specific experiment, finds speedup now likely, but other interesting findings as well View original →
A r/singularity thread is circulating METR’s latest note, uplift update, on how AI tools affect software productivity. The topic matters because METR’s earlier study became a common reference for the claim that AI assistance could slow down experienced developers in realistic open-source work.
What METR now reports
METR reiterates the original finding: in data from February to June 2025, experienced open-source developers were around 20% slower on study tasks when AI tools were allowed. In the follow-up, the team says that late-2025/early-2026 conditions look different. Broader adoption of agentic tools such as Claude Code and Codex appears to have changed developer behavior and task selection.
The update includes raw estimates that move toward speedup: for returning developers, METR reports an estimated -18% effect (confidence interval -38% to +9%); for newly recruited developers, an estimated -4% effect (confidence interval -15% to +9%). In their framing, negative values indicate speedup. That shift is directionally important, but METR repeatedly warns that this signal is noisy.
Why uncertainty remains high
- Selection effects increased: surveyed developers reported skipping tasks they did not want to do without AI (30% to 50%).
- Recruitment incentives changed: pay dropped from $150/hour in the original study to $50/hour in the follow-up.
- Measurement became harder: multi-agent workflows made time tracking less reliable.
METR’s interpretation is nuanced: productivity uplift is now likely higher than in early 2025, but current experimental design may systematically miss the tasks and developers with the biggest AI gains. So this is not a clean “AI is definitely X% faster” conclusion. It is a methodological warning plus a directional update.
For engineering teams, the practical takeaway is to instrument productivity locally instead of importing a single external headline number. Segment by task type, tool stack, review quality, and completion criteria. The Reddit discussion is useful because it surfaces that measurement design is now as important as model capability when evaluating AI coding impact.
Related Articles
Anthropic put hard numbers behind Claude’s election safeguards. Opus 4.7 and Sonnet 4.6 responded appropriately 100% and 99.8% of the time in a 600-prompt election-policy test, and triggered web search 92% and 95% of the time on U.S. midterm-related queries.
LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.
AnthropicAI highlighted an Engineering Blog post on March 24, 2026 about using a multi-agent harness to keep Claude productive across frontend and long-running software engineering tasks. The underlying Anthropic post explains how initializer agents, incremental coding sessions, progress logs, structured feature lists, and browser-based testing can reduce context-window drift and premature task completion.
Comments (0)
No comments yet. Be the first to comment!