HN Turns on SWE-bench Verified as Contamination Overtakes the Score

The energy around this Hacker News thread was not “new leaderboard, who won?” It was the opposite. In its write-up, OpenAI said SWE-bench Verified no longer meaningfully measures frontier coding ability, and the HN discussion (item 47910388) treated that as confirmation of something many agent builders already suspected: once a public coding benchmark becomes important enough, contamination and over-optimization start to matter as much as the model itself.

OpenAI’s case rests on two problems. First, it audited 138 Verified tasks that its models often failed and found material issues in 59.4% of them. Some tests were too narrow, rejecting solutions that were functionally correct but implemented differently. Others were too wide, checking for behavior that the prompt never actually asked for. Second, the company found evidence that frontier models could reproduce gold patches or highly specific problem details, suggesting benchmark exposure during training.

That combination changes how the scores should be read. A higher number may reflect coding skill, but it may also reflect how much a model has already absorbed from the benchmark’s public footprint. HN commenters pushed the point further. One of the benchmark’s co-creators replied that SWE-bench Verified is already saturated at 93.9% and pointed readers toward upcoming multilingual and multimodal successors. Others argued that any public benchmark with enough prestige will eventually be gamed, intentionally or not, because the marketing incentives are too strong.

OpenAI now recommends reporting SWE-bench Pro instead of SWE-bench Verified.
The audited slice found flawed test design or underspecified tasks in 59.4% of reviewed failures.
OpenAI also highlighted contamination risk from training on public repositories that contain the original issues and fixes.

The thread matters because it marks a shift in the coding-agent conversation. Benchmark talk is moving away from raw pass rates and toward benchmark health itself: contamination, portability, and whether “correct” means “merged in real life” or merely “passes this harness.” HN upvoted this because it was not another claim about model progress. It was an admission that one of the field’s most visible yardsticks has stopped telling the truth cleanly enough.

HN Turns on SWE-bench Verified as Contamination Overtakes the Score

Related Articles

HN Sees Anthropic's Claude Code Postmortem as a Product-Layer Failure, Not a Model Collapse

Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code

Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local

Comments (0)

Leave a Comment

Related Articles

HN Sees Anthropic's Claude Code Postmortem as a Product-Layer Failure, Not a Model Collapse

Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code
LLM Hacker News Mar 12, 2026 1 min read

Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local
LLM Reddit Apr 20, 2026 2 min read