LocalLLaMA Rallies Around a Qwen3.6 Result That Puts the Scaffold on Trial

Original: Qwen3.6-35B becomes competitive with cloud models when paired with the right agent View original →

Read in other languages: 한국어日本語
LLM Apr 24, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Why the Reddit thread took off

The main thing driving this LocalLLaMA thread was not raw model fandom. It was the feeling that the benchmark stack itself was suddenly under suspicion. The post reports that after earlier experiments moved a smaller Qwen setup from roughly 19.11% to 45.56% by changing the scaffold, the author then paired Qwen3.6-35B-A3B with the same little-coder harness and reached 78.67% on the full 225-exercise Aider Polyglot benchmark. That combination, at Reddit scale, is irresistible: a local model, a coding-agent benchmark, and a result that implies harness design may be load-bearing rather than incidental. At crawl time the thread had 689 points and 167 comments, and one of the highest-signal replies said the 19-to-45-to-78 progression “makes you question every benchmark comparison” that does not control for scaffold choice.

What the linked benchmark document says

The linked benchmark write-up is detailed enough to take seriously. It describes one end-to-end run with Qwen3.6-35B-A3B, quantified as 35B total / 3B active MoE, using a Q4_K_M GGUF around 22.1 GB on disk. The run used llama.cpp on an RTX 5070 Laptop with 8 GB VRAM, with the MoE weights largely offloaded to system RAM. The reported headline is 177 / 225 solved, or 78.67%, which the author says places the agent in the public Aider Polyglot top-10 band. The document also breaks down language-level results: JavaScript at 89.8%, Python at 88.2%, C++ at 84.6%, Java at 76.6%, Go at 74.4%, and Rust at 53.3%.

Why the scaffold is the story

The write-up argues that the gain is not mainly about retry logic or benchmark luck. The biggest delta came from first-attempt solves, which suggests the harness is making the model commit more effectively, not merely cleaning up edge cases after a failure. The earlier little-coder paper linked in the same Reddit post described the scaffold in concrete terms: a write guard to stop destructive full-file rewrites, bounded thinking, explicit workspace discovery, and smaller guidance injections tailored for local models. Community replies picked up exactly that point. Several commenters said the tools and environment are becoming almost as important as the model. Others immediately asked whether little-coder’s design, rather than the underlying Qwen family alone, was the real transferable asset.

Why this matters for local coding agents

The significance is not that one repo has settled the local-coder leaderboard. It is that the thread turns scaffold choice from a footnote into a first-class variable. If a local model on consumer-ish hardware can move into that score band when the harness is adapted to its limits, then many “small model versus frontier model” comparisons are partly comparing mismatched agent assumptions. LocalLLaMA read the post that way. The loudest reaction was not “Qwen wins.” It was “maybe we have been benchmarking the wrapper as much as the model.” That is a much more interesting community signal, and it explains why the discussion immediately branched into questions about pi.dev, terminal-bench follow-ups, and which parts of the scaffold actually deserve the credit.

Sources: little-coder benchmark doc · supporting write-up · Reddit discussion

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.