METR follow-up: from “20% slowdown” to possible AI speedup for expert developers
Original: METR follows up on often cited study from last year on 20% developer slowdown in specific experiment, finds speedup now likely, but other interesting findings as well View original →
A r/singularity thread is circulating METR’s latest note, uplift update, on how AI tools affect software productivity. The topic matters because METR’s earlier study became a common reference for the claim that AI assistance could slow down experienced developers in realistic open-source work.
What METR now reports
METR reiterates the original finding: in data from February to June 2025, experienced open-source developers were around 20% slower on study tasks when AI tools were allowed. In the follow-up, the team says that late-2025/early-2026 conditions look different. Broader adoption of agentic tools such as Claude Code and Codex appears to have changed developer behavior and task selection.
The update includes raw estimates that move toward speedup: for returning developers, METR reports an estimated -18% effect (confidence interval -38% to +9%); for newly recruited developers, an estimated -4% effect (confidence interval -15% to +9%). In their framing, negative values indicate speedup. That shift is directionally important, but METR repeatedly warns that this signal is noisy.
Why uncertainty remains high
- Selection effects increased: surveyed developers reported skipping tasks they did not want to do without AI (30% to 50%).
- Recruitment incentives changed: pay dropped from $150/hour in the original study to $50/hour in the follow-up.
- Measurement became harder: multi-agent workflows made time tracking less reliable.
METR’s interpretation is nuanced: productivity uplift is now likely higher than in early 2025, but current experimental design may systematically miss the tasks and developers with the biggest AI gains. So this is not a clean “AI is definitely X% faster” conclusion. It is a methodological warning plus a directional update.
For engineering teams, the practical takeaway is to instrument productivity locally instead of importing a single external headline number. Segment by task type, tool stack, review quality, and completion criteria. The Reddit discussion is useful because it surfaces that measurement design is now as important as model capability when evaluating AI coding impact.
Related Articles
A LocalLLaMA post pointed to a new Hugging Face dataset of human-written code reviews, pairing before-and-after code changes with inline reviewer comments and negative examples across 37 languages.
Hacker News highlighted SWE-CI, an arXiv benchmark that evaluates whether LLM agents can sustain repository quality across CI-driven iterations, not just land a single passing patch.
METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.
Comments (0)
No comments yet. Be the first to comment!