METR follow-up: from “20% slowdown” to possible AI speedup for expert developers
Original: METR follows up on often cited study from last year on 20% developer slowdown in specific experiment, finds speedup now likely, but other interesting findings as well View original →
A r/singularity thread is circulating METR’s latest note, uplift update, on how AI tools affect software productivity. The topic matters because METR’s earlier study became a common reference for the claim that AI assistance could slow down experienced developers in realistic open-source work.
What METR now reports
METR reiterates the original finding: in data from February to June 2025, experienced open-source developers were around 20% slower on study tasks when AI tools were allowed. In the follow-up, the team says that late-2025/early-2026 conditions look different. Broader adoption of agentic tools such as Claude Code and Codex appears to have changed developer behavior and task selection.
The update includes raw estimates that move toward speedup: for returning developers, METR reports an estimated -18% effect (confidence interval -38% to +9%); for newly recruited developers, an estimated -4% effect (confidence interval -15% to +9%). In their framing, negative values indicate speedup. That shift is directionally important, but METR repeatedly warns that this signal is noisy.
Why uncertainty remains high
- Selection effects increased: surveyed developers reported skipping tasks they did not want to do without AI (30% to 50%).
- Recruitment incentives changed: pay dropped from $150/hour in the original study to $50/hour in the follow-up.
- Measurement became harder: multi-agent workflows made time tracking less reliable.
METR’s interpretation is nuanced: productivity uplift is now likely higher than in early 2025, but current experimental design may systematically miss the tasks and developers with the biggest AI gains. So this is not a clean “AI is definitely X% faster” conclusion. It is a methodological warning plus a directional update.
For engineering teams, the practical takeaway is to instrument productivity locally instead of importing a single external headline number. Segment by task type, tool stack, review quality, and completion criteria. The Reddit discussion is useful because it surfaces that measurement design is now as important as model capability when evaluating AI coding impact.
Related Articles
The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
The thread’s energy centered on the architecture claim: what does “encoder-free” really mean for a 12B multimodal model?
Local multimodal AI is moving into the 12B class. Google Gemma introduced Gemma 4 12B under Apache 2.0, describing a unified encoder-free design for image, audio, and text inputs.