LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard
Original: Qwen3.5-35B-A3B hits 37.8% on SWE-bench Verified Hard — nearly matching Claude Opus 4.6 (40%) with the right verification strategy View original →
What the Reddit experiment reports
A March 4, 2026 post in r/LocalLLaMA shared a coding-agent experiment using Qwen3.5-35B-A3B (described as a MoE setup with 3B active parameters) self-hosted through vLLM. The author evaluated on SWE-bench Verified tasks and argued that agent-loop strategy, not only model size, was the key performance lever. The headline claim is a 37.8% score on SWE-bench Verified Hard (45 tasks), compared with a 22.2% baseline from the same harness.
The post links experiment artifacts and logs in a public repository and includes a simple comparison table across strategies, giving the community enough detail to inspect the method.
Strategy change: verify after each edit
The core intervention is straightforward: after each successful file_edit, the agent receives an instruction to run a short verification step (for example via inline Python or a temporary script) before moving on. In the reported results, “verify-at-last” improved Hard tasks from 22.2% to 33.3%, while “verify-on-edit” pushed to 37.8%. On the broader 500-task run, the post reports 64% baseline and 67% under verify-at-last.
The same write-up cites a 40.0% reference for Claude Opus 4.6 on the Hard split, framing the gap as narrower than expected for a smaller active-parameter setup.
Community caveats and evaluation risk
Top comments in the thread emphasize a familiar benchmark caveat: potential contamination and benchmark aging. One commenter specifically suggests re-running once newer SWE-rebench-style tasks accumulate, to reduce the chance that results are inflated by leaked training signals. This caution does not invalidate the reported improvement, but it does affect how confidently teams can generalize absolute scores.
Why this matters for coding-agent engineering
The practical takeaway is that lightweight process constraints can deliver substantial gains without adding complex search machinery. The author says MCTS and related tree-search variants underperformed in this setup, while simple sequential verification gave better returns per unit complexity. For teams building code agents, that suggests a pragmatic priority order: strengthen edit-test discipline and observability first, then evaluate heavier planner architectures only if simpler loops saturate.
Related Articles
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.
OmniCoder-9B packages agent-style coding behavior into a smaller open model by training on more than 425,000 curated trajectories from real tool-using workflows.
METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.
Comments (0)
No comments yet. Be the first to comment!