LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard

What the Reddit experiment reports

A March 4, 2026 post in r/LocalLLaMA shared a coding-agent experiment using Qwen3.5-35B-A3B (described as a MoE setup with 3B active parameters) self-hosted through vLLM. The author evaluated on SWE-bench Verified tasks and argued that agent-loop strategy, not only model size, was the key performance lever. The headline claim is a 37.8% score on SWE-bench Verified Hard (45 tasks), compared with a 22.2% baseline from the same harness.

The post links experiment artifacts and logs in a public repository and includes a simple comparison table across strategies, giving the community enough detail to inspect the method.

Strategy change: verify after each edit

The core intervention is straightforward: after each successful file_edit, the agent receives an instruction to run a short verification step (for example via inline Python or a temporary script) before moving on. In the reported results, “verify-at-last” improved Hard tasks from 22.2% to 33.3%, while “verify-on-edit” pushed to 37.8%. On the broader 500-task run, the post reports 64% baseline and 67% under verify-at-last.

The same write-up cites a 40.0% reference for Claude Opus 4.6 on the Hard split, framing the gap as narrower than expected for a smaller active-parameter setup.

Community caveats and evaluation risk

Top comments in the thread emphasize a familiar benchmark caveat: potential contamination and benchmark aging. One commenter specifically suggests re-running once newer SWE-rebench-style tasks accumulate, to reduce the chance that results are inflated by leaked training signals. This caution does not invalidate the reported improvement, but it does affect how confidently teams can generalize absolute scores.

Why this matters for coding-agent engineering

The practical takeaway is that lightweight process constraints can deliver substantial gains without adding complex search machinery. The author says MCTS and related tree-search variants underperformed in this setup, while simple sequential verification gave better returns per unit complexity. For teams building code agents, that suggests a pragmatic priority order: strengthen edit-test discipline and observability first, then evaluate heavier planner architectures only if simpler loops saturate.

Reddit thread · Experiment repository

LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard

What the Reddit experiment reports

Strategy change: verify after each edit

Community caveats and evaluation risk

Why this matters for coding-agent engineering

Related Articles

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents

Hacker News Debates Whether LLM Coding Progress Has Stalled on Maintainer Merge Rates

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

Related Articles

SWE-rebench January 2026 Snapshot Highlights a Tight Race in Coding Agents
LLM Reddit Feb 14, 2026 1 min read

Hacker News Debates Whether LLM Coding Progress Has Stalled on Maintainer Merge Rates
LLM Hacker News Mar 14, 2026 2 min read

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks
LLM Reddit Apr 27, 2026 2 min read