LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard
Original: Qwen3.5-35B-A3B hits 37.8% on SWE-bench Verified Hard — nearly matching Claude Opus 4.6 (40%) with the right verification strategy View original →
What the Reddit experiment reports
A March 4, 2026 post in r/LocalLLaMA shared a coding-agent experiment using Qwen3.5-35B-A3B (described as a MoE setup with 3B active parameters) self-hosted through vLLM. The author evaluated on SWE-bench Verified tasks and argued that agent-loop strategy, not only model size, was the key performance lever. The headline claim is a 37.8% score on SWE-bench Verified Hard (45 tasks), compared with a 22.2% baseline from the same harness.
The post links experiment artifacts and logs in a public repository and includes a simple comparison table across strategies, giving the community enough detail to inspect the method.
Strategy change: verify after each edit
The core intervention is straightforward: after each successful file_edit, the agent receives an instruction to run a short verification step (for example via inline Python or a temporary script) before moving on. In the reported results, “verify-at-last” improved Hard tasks from 22.2% to 33.3%, while “verify-on-edit” pushed to 37.8%. On the broader 500-task run, the post reports 64% baseline and 67% under verify-at-last.
The same write-up cites a 40.0% reference for Claude Opus 4.6 on the Hard split, framing the gap as narrower than expected for a smaller active-parameter setup.
Community caveats and evaluation risk
Top comments in the thread emphasize a familiar benchmark caveat: potential contamination and benchmark aging. One commenter specifically suggests re-running once newer SWE-rebench-style tasks accumulate, to reduce the chance that results are inflated by leaked training signals. This caution does not invalidate the reported improvement, but it does affect how confidently teams can generalize absolute scores.
Why this matters for coding-agent engineering
The practical takeaway is that lightweight process constraints can deliver substantial gains without adding complex search machinery. The author says MCTS and related tree-search variants underperformed in this setup, while simple sequential verification gave better returns per unit complexity. For teams building code agents, that suggests a pragmatic priority order: strengthen edit-test discipline and observability first, then evaluate heavier planner architectures only if simpler loops saturate.
Related Articles
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.
A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.
LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.