LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks
Original: Confirmed: SWE Bench is now a benchmaxxed benchmark View original →
The thread’s mood was not surprise, it was recognition
LocalLLaMA did not react to this post like a shocking revelation. The top responses read more like a grim “finally, someone said it plainly.” The Reddit title called SWE-bench Verified “benchmaxxed,” and the comments immediately linked it to Goodhart’s law: once a public benchmark becomes the target, it stops being a clean measure. That mood matters, because it shows how much trust in headline coding-benchmark numbers has already eroded among the most benchmark-literate users.
The catalyst was an OpenAI analysis arguing that SWE-bench Verified no longer cleanly measures frontier coding capability and should be replaced, at least for reporting, by SWE-bench Pro.
What the OpenAI analysis said
OpenAI’s post gives two main reasons. First, the benchmark’s remaining failures are no longer obviously model failures. The company wrote that performance had risen from 74.9% to 80.9% in six months, then said an audit of a hard subset found major test-design or task-description problems in 59.4% of the 138 reviewed cases. Second, OpenAI argued contamination is now too visible to ignore. In its examples, frontier models could reproduce the gold patch or task-specific details that should not have been inferable from the prompt alone, suggesting exposure to benchmark material during training.
That combination is ugly for a public leaderboard: some failures reject correct work, while some successes may reflect dataset exposure more than real software-engineering progress.
What the subreddit added
The most upvoted Reddit comments did not spend much time defending Verified. One called this the inevitable endpoint for any public benchmark. Another reduced the whole problem to Goodhart’s law in one line. Others said benchmarks need to stay closed or continuously refreshed if model developers are training on the same public ecosystem that generated the tasks. A few commenters pointed to SWE-rebench precisely because it rotates problems, while some noted lingering political skepticism about why vendors abandon one benchmark and adopt another. But even with that skepticism, the broader consensus was clear: static public benchmarks decay fast once models get strong enough and the underlying repos are widely crawled.
Why it matters
This matters beyond one leaderboard. Coding-agent competition now moves fast enough that benchmark hygiene is a product issue, a research issue, and a marketing issue all at once. If models can score well because they have absorbed the tests, issue text, or even gold patches, then score inflation stops telling buyers much about production usefulness. LocalLLaMA reacted strongly because many users already suspected this. The OpenAI post simply supplied cleaner evidence and a more formal obituary for the old bragging-rights era of SWE-bench Verified.
Source: OpenAI analysis · r/LocalLLaMA thread
Related Articles
Alibaba’s April 22 Qwen3.6-Max-Preview post claims top scores across six coding benchmarks and clear gains over Qwen3.6-Plus. The caveat is just as important: this is a hosted proprietary preview, not a new open-weight Qwen release.
What energized LocalLLaMA was not just another Qwen score jump. It was the claim that changing the agent scaffold moved the same family of local models from 19% to 45% to 78.7%, making benchmark comparisons feel less settled than many assumed.
Why it matters: public coding benchmarks are getting less useful at the frontier, so a fresh product-side score can move developer attention fast. Cursor says GPT-5.5 is now its top model on CursorBench at 72.8% and is discounting usage by 50% through May 2.
Comments (0)
No comments yet. Be the first to comment!