LocalLLaMA Rallies Around a Qwen3.6 Result That Puts the Scaffold on Trial
Original: Qwen3.6-35B becomes competitive with cloud models when paired with the right agent View original →
Why the Reddit thread took off
The main thing driving this LocalLLaMA thread was not raw model fandom. It was the feeling that the benchmark stack itself was suddenly under suspicion. The post reports that after earlier experiments moved a smaller Qwen setup from roughly 19.11% to 45.56% by changing the scaffold, the author then paired Qwen3.6-35B-A3B with the same little-coder harness and reached 78.67% on the full 225-exercise Aider Polyglot benchmark. That combination, at Reddit scale, is irresistible: a local model, a coding-agent benchmark, and a result that implies harness design may be load-bearing rather than incidental. At crawl time the thread had 689 points and 167 comments, and one of the highest-signal replies said the 19-to-45-to-78 progression “makes you question every benchmark comparison” that does not control for scaffold choice.
What the linked benchmark document says
The linked benchmark write-up is detailed enough to take seriously. It describes one end-to-end run with Qwen3.6-35B-A3B, quantified as 35B total / 3B active MoE, using a Q4_K_M GGUF around 22.1 GB on disk. The run used llama.cpp on an RTX 5070 Laptop with 8 GB VRAM, with the MoE weights largely offloaded to system RAM. The reported headline is 177 / 225 solved, or 78.67%, which the author says places the agent in the public Aider Polyglot top-10 band. The document also breaks down language-level results: JavaScript at 89.8%, Python at 88.2%, C++ at 84.6%, Java at 76.6%, Go at 74.4%, and Rust at 53.3%.
Why the scaffold is the story
The write-up argues that the gain is not mainly about retry logic or benchmark luck. The biggest delta came from first-attempt solves, which suggests the harness is making the model commit more effectively, not merely cleaning up edge cases after a failure. The earlier little-coder paper linked in the same Reddit post described the scaffold in concrete terms: a write guard to stop destructive full-file rewrites, bounded thinking, explicit workspace discovery, and smaller guidance injections tailored for local models. Community replies picked up exactly that point. Several commenters said the tools and environment are becoming almost as important as the model. Others immediately asked whether little-coder’s design, rather than the underlying Qwen family alone, was the real transferable asset.
Why this matters for local coding agents
The significance is not that one repo has settled the local-coder leaderboard. It is that the thread turns scaffold choice from a footnote into a first-class variable. If a local model on consumer-ish hardware can move into that score band when the harness is adapted to its limits, then many “small model versus frontier model” comparisons are partly comparing mismatched agent assumptions. LocalLLaMA read the post that way. The loudest reaction was not “Qwen wins.” It was “maybe we have been benchmarking the wrapper as much as the model.” That is a much more interesting community signal, and it explains why the discussion immediately branched into questions about pi.dev, terminal-bench follow-ups, and which parts of the scaffold actually deserve the credit.
Sources: little-coder benchmark doc · supporting write-up · Reddit discussion
Related Articles
Alibaba’s April 22 Qwen3.6-Max-Preview post claims top scores across six coding benchmarks and clear gains over Qwen3.6-Plus. The caveat is just as important: this is a hosted proprietary preview, not a new open-weight Qwen release.
HN upvoted the joke because it exposed a real discomfort: one vivid SVG prompt can make a small local model look better than a flagship model, but nobody agrees what that proves.
The LocalLLaMA thread cared less about a release headline and more about which Qwen3.6 GGUF quant actually works. Unsloth’s benchmark post pushed the discussion into KLD, disk size, CUDA 13.2 failures, and the messy details that decide local inference quality.
Comments (0)
No comments yet. Be the first to comment!