METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.
#swe-bench
A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.
A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.
A Hacker News discussion highlights arXiv:2602.11988, which finds that repository context files like AGENTS.md often reduced coding-agent task success while increasing inference cost by more than 20%.
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.