LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.
#swe-bench
RSS FeedHN piled in because this was bigger than another benchmark refresh. OpenAI said SWE-bench Verified is no longer a trustworthy frontier coding signal, and the thread immediately shifted to contamination, saturated leaderboards, and whether public coding evals can stay clean at all.
Why it matters: Alibaba is putting a small-active-parameter multimodal coding model into open weights rather than keeping it API-only. The tweet says Qwen3.6-35B-A3B has 35B total parameters, 3B active parameters, and an Apache 2.0 license; the blog reports 73.4 on SWE-bench Verified and 51.5 on Terminal-Bench 2.0.
Hacker News picked up Z.ai's GLM-5.1 as a model aimed less at one-shot wins and more at sustained agentic work. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and long-horizon runs that keep improving through hundreds of iterations and thousands of tool calls.
A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.
METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.
A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.
A trending Reddit post in r/singularity points to OpenAI's statement that it no longer evaluates on SWE-bench Verified, citing at least 16.4% flawed test cases. The announcement reframes how coding-model benchmark scores should be interpreted in production decision-making.
A Hacker News discussion highlights arXiv:2602.11988, which finds that repository context files like AGENTS.md often reduced coding-agent task success while increasing inference cost by more than 20%.
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.