LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard
Original: Qwen3.5-35B-A3B hits 37.8% on SWE-bench Verified Hard — nearly matching Claude Opus 4.6 (40%) with the right verification strategy View original →
What the Reddit experiment reports
A March 4, 2026 post in r/LocalLLaMA shared a coding-agent experiment using Qwen3.5-35B-A3B (described as a MoE setup with 3B active parameters) self-hosted through vLLM. The author evaluated on SWE-bench Verified tasks and argued that agent-loop strategy, not only model size, was the key performance lever. The headline claim is a 37.8% score on SWE-bench Verified Hard (45 tasks), compared with a 22.2% baseline from the same harness.
The post links experiment artifacts and logs in a public repository and includes a simple comparison table across strategies, giving the community enough detail to inspect the method.
Strategy change: verify after each edit
The core intervention is straightforward: after each successful file_edit, the agent receives an instruction to run a short verification step (for example via inline Python or a temporary script) before moving on. In the reported results, “verify-at-last” improved Hard tasks from 22.2% to 33.3%, while “verify-on-edit” pushed to 37.8%. On the broader 500-task run, the post reports 64% baseline and 67% under verify-at-last.
The same write-up cites a 40.0% reference for Claude Opus 4.6 on the Hard split, framing the gap as narrower than expected for a smaller active-parameter setup.
Community caveats and evaluation risk
Top comments in the thread emphasize a familiar benchmark caveat: potential contamination and benchmark aging. One commenter specifically suggests re-running once newer SWE-rebench-style tasks accumulate, to reduce the chance that results are inflated by leaked training signals. This caution does not invalidate the reported improvement, but it does affect how confidently teams can generalize absolute scores.
Why this matters for coding-agent engineering
The practical takeaway is that lightweight process constraints can deliver substantial gains without adding complex search machinery. The author says MCTS and related tree-search variants underperformed in this setup, while simple sequential verification gave better returns per unit complexity. For teams building code agents, that suggests a pragmatic priority order: strengthen edit-test discipline and observability first, then evaluate heavier planner architectures only if simpler loops saturate.
Related Articles
LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.
r/LocalLLaMA pushed this post up because the “trust me bro” report had real operating conditions: 8-bit quantization, 64k context, OpenCode, and Android debugging.
Alibaba’s April 22 Qwen3.6-Max-Preview post claims top scores across six coding benchmarks and clear gains over Qwen3.6-Plus. The caveat is just as important: this is a hosted proprietary preview, not a new open-weight Qwen release.
Comments (0)
No comments yet. Be the first to comment!