The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
#swe-bench
RSS FeedHN treated OpenAI's post less as benchmark housekeeping and more as an obituary for a famous coding leaderboard. The thread cared far more about flawed tests and contamination than about who happened to top the chart first.
LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.
Why it matters: Alibaba is putting a small-active-parameter multimodal coding model into open weights rather than keeping it API-only. The tweet says Qwen3.6-35B-A3B has 35B total parameters, 3B active parameters, and an Apache 2.0 license; the blog reports 73.4 on SWE-bench Verified and 51.5 on Terminal-Bench 2.0.
Hacker News picked up Z.ai's GLM-5.1 as a model aimed less at one-shot wins and more at sustained agentic work. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and long-horizon runs that keep improving through hundreds of iterations and thousands of tool calls.
A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.
METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.
A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.
A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.
A Hacker News discussion highlights arXiv:2602.11988, which finds that repository context files like AGENTS.md often reduced coding-agent task success while increasing inference cost by more than 20%.
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.