HN Turns on SWE-bench Verified as Contamination Overtakes the Score
Original: SWE-bench Verified no longer measures frontier coding capabilities View original →
The energy around this Hacker News thread was not “new leaderboard, who won?” It was the opposite. In its write-up, OpenAI said SWE-bench Verified no longer meaningfully measures frontier coding ability, and the HN discussion (item 47910388) treated that as confirmation of something many agent builders already suspected: once a public coding benchmark becomes important enough, contamination and over-optimization start to matter as much as the model itself.
OpenAI’s case rests on two problems. First, it audited 138 Verified tasks that its models often failed and found material issues in 59.4% of them. Some tests were too narrow, rejecting solutions that were functionally correct but implemented differently. Others were too wide, checking for behavior that the prompt never actually asked for. Second, the company found evidence that frontier models could reproduce gold patches or highly specific problem details, suggesting benchmark exposure during training.
That combination changes how the scores should be read. A higher number may reflect coding skill, but it may also reflect how much a model has already absorbed from the benchmark’s public footprint. HN commenters pushed the point further. One of the benchmark’s co-creators replied that SWE-bench Verified is already saturated at 93.9% and pointed readers toward upcoming multilingual and multimodal successors. Others argued that any public benchmark with enough prestige will eventually be gamed, intentionally or not, because the marketing incentives are too strong.
- OpenAI now recommends reporting SWE-bench Pro instead of SWE-bench Verified.
- The audited slice found flawed test design or underspecified tasks in 59.4% of reviewed failures.
- OpenAI also highlighted contamination risk from training on public repositories that contain the original issues and fixes.
The thread matters because it marks a shift in the coding-agent conversation. Benchmark talk is moving away from raw pass rates and toward benchmark health itself: contamination, portability, and whether “correct” means “merged in real life” or merely “passes this harness.” HN upvoted this because it was not another claim about model progress. It was an admission that one of the field’s most visible yardsticks has stopped telling the truth cleanly enough.
Related Articles
Hacker News treated Anthropic’s Claude Code write-up as a rare admission that product defaults and prompt-layer tweaks can make a model feel worse even when the API layer stays unchanged. By crawl time on April 24, 2026, the thread had 727 points and 543 comments.
METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.
r/LocalLLaMA pushed this post up because the “trust me bro” report had real operating conditions: 8-bit quantization, 64k context, OpenCode, and Android debugging.
Comments (0)
No comments yet. Be the first to comment!