LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks
Original: Confirmed: SWE Bench is now a benchmaxxed benchmark View original →
The thread’s mood was not surprise, it was recognition
LocalLLaMA did not react to this post like a shocking revelation. The top responses read more like a grim “finally, someone said it plainly.” The Reddit title called SWE-bench Verified “benchmaxxed,” and the comments immediately linked it to Goodhart’s law: once a public benchmark becomes the target, it stops being a clean measure. That mood matters, because it shows how much trust in headline coding-benchmark numbers has already eroded among the most benchmark-literate users.
The catalyst was an OpenAI analysis arguing that SWE-bench Verified no longer cleanly measures frontier coding capability and should be replaced, at least for reporting, by SWE-bench Pro.
What the OpenAI analysis said
OpenAI’s post gives two main reasons. First, the benchmark’s remaining failures are no longer obviously model failures. The company wrote that performance had risen from 74.9% to 80.9% in six months, then said an audit of a hard subset found major test-design or task-description problems in 59.4% of the 138 reviewed cases. Second, OpenAI argued contamination is now too visible to ignore. In its examples, frontier models could reproduce the gold patch or task-specific details that should not have been inferable from the prompt alone, suggesting exposure to benchmark material during training.
That combination is ugly for a public leaderboard: some failures reject correct work, while some successes may reflect dataset exposure more than real software-engineering progress.
What the subreddit added
The most upvoted Reddit comments did not spend much time defending Verified. One called this the inevitable endpoint for any public benchmark. Another reduced the whole problem to Goodhart’s law in one line. Others said benchmarks need to stay closed or continuously refreshed if model developers are training on the same public ecosystem that generated the tasks. A few commenters pointed to SWE-rebench precisely because it rotates problems, while some noted lingering political skepticism about why vendors abandon one benchmark and adopt another. But even with that skepticism, the broader consensus was clear: static public benchmarks decay fast once models get strong enough and the underlying repos are widely crawled.
Why it matters
This matters beyond one leaderboard. Coding-agent competition now moves fast enough that benchmark hygiene is a product issue, a research issue, and a marketing issue all at once. If models can score well because they have absorbed the tests, issue text, or even gold patches, then score inflation stops telling buyers much about production usefulness. LocalLLaMA reacted strongly because many users already suspected this. The OpenAI post simply supplied cleaner evidence and a more formal obituary for the old bragging-rights era of SWE-bench Verified.
Source: OpenAI analysis · r/LocalLLaMA thread
Related Articles
The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.
A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.