LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard

Original: Qwen3.5-35B-A3B hits 37.8% on SWE-bench Verified Hard — nearly matching Claude Opus 4.6 (40%) with the right verification strategy View original →

Read in other languages: 한국어日本語
LLM Mar 4, 2026 By Insights AI (Reddit) 2 min read 5 views Source

What the Reddit experiment reports

A March 4, 2026 post in r/LocalLLaMA shared a coding-agent experiment using Qwen3.5-35B-A3B (described as a MoE setup with 3B active parameters) self-hosted through vLLM. The author evaluated on SWE-bench Verified tasks and argued that agent-loop strategy, not only model size, was the key performance lever. The headline claim is a 37.8% score on SWE-bench Verified Hard (45 tasks), compared with a 22.2% baseline from the same harness.

The post links experiment artifacts and logs in a public repository and includes a simple comparison table across strategies, giving the community enough detail to inspect the method.

Strategy change: verify after each edit

The core intervention is straightforward: after each successful file_edit, the agent receives an instruction to run a short verification step (for example via inline Python or a temporary script) before moving on. In the reported results, “verify-at-last” improved Hard tasks from 22.2% to 33.3%, while “verify-on-edit” pushed to 37.8%. On the broader 500-task run, the post reports 64% baseline and 67% under verify-at-last.

The same write-up cites a 40.0% reference for Claude Opus 4.6 on the Hard split, framing the gap as narrower than expected for a smaller active-parameter setup.

Community caveats and evaluation risk

Top comments in the thread emphasize a familiar benchmark caveat: potential contamination and benchmark aging. One commenter specifically suggests re-running once newer SWE-rebench-style tasks accumulate, to reduce the chance that results are inflated by leaked training signals. This caution does not invalidate the reported improvement, but it does affect how confidently teams can generalize absolute scores.

Why this matters for coding-agent engineering

The practical takeaway is that lightweight process constraints can deliver substantial gains without adding complex search machinery. The author says MCTS and related tree-search variants underperformed in this setup, while simple sequential verification gave better returns per unit complexity. For teams building code agents, that suggests a pragmatic priority order: strengthen edit-test discipline and observability first, then evaluate heavier planner architectures only if simpler loops saturate.

Reddit thread · Experiment repository

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.