HN thinks the SWE-bench story is about contamination, not bragging rights
Original: SWE-bench Verified no longer measures frontier coding capabilities View original →
On Hacker News, this landed less like a fresh leaderboard update and more like an admission that a flagship benchmark has stopped being trustworthy. In its analysis, OpenAI argues that SWE-bench Verified no longer measures frontier coding capability in a useful way. The two big reasons are simple and damaging: too many tests reject valid fixes, and too many models appear to have seen the problems or solutions during training. HN immediately centered the discussion there instead of treating this as another score-war post.
The concrete numbers are what made the argument hard to ignore. OpenAI audited 138 problems that o3 did not consistently solve and says 59.4% had material issues in test design or problem specification. Of those audited tasks, 35.5% had narrow tests that rejected functionally correct implementations, while 18.8% had wide tests that required extra behavior not stated in the prompt. On top of that, OpenAI says frontier models could often reproduce the original gold patch or verbatim details from the problem statement, which is a strong sign of contamination. At that point, a higher score starts to measure exposure and optimization pressure, not clean software engineering ability.
HN commenters quickly widened the lens from OpenAI to the benchmark ecosystem itself. A SWE-bench co-creator showed up in the thread and noted that Verified is now saturated at 93.9%, while the newer multilingual and multimodal variants still have room to grow. Other commenters took a harsher line: any benchmark that becomes influential also becomes training data, marketing material, and an optimization target. That matched a long-running complaint on HN that many supposedly benchmark-winning pull requests still look nothing like code a human reviewer would merge into a real project.
The practical takeaway was not that evals are useless. It was that one famous benchmark cannot keep carrying frontier coding claims once the whole industry has trained on it, tuned around it, and learned how to game its blind spots. OpenAI now recommends SWE-bench Pro instead of Verified. HN seemed ready to move on, but without much faith that the next standard will stay clean for long. The thread's real subject was not leaderboard drama. It was how fast a benchmark turns into theater once everyone knows it matters.
Related Articles
HN piled in because this was bigger than another benchmark refresh. OpenAI said SWE-bench Verified is no longer a trustworthy frontier coding signal, and the thread immediately shifted to contamination, saturated leaderboards, and whether public coding evals can stay clean at all.
Hacker News treated Anthropic’s Claude Code write-up as a rare admission that product defaults and prompt-layer tweaks can make a model feel worse even when the API layer stays unchanged. By crawl time on April 24, 2026, the thread had 727 points and 543 comments.
LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.
Comments (0)
No comments yet. Be the first to comment!