Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local
Original: I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude View original →
This r/LocalLLaMA post was closer to a field report than a benchmark, which is why it landed. The author said they were running Qwen3.6-35B-A3B with 8-bit quantization and a 64k context window through OpenCode on a MacBook Pro M5 Max with 128GB of memory. They also admitted it was a “trust me bro” post, but the details gave the thread something concrete to test against.
The workload was not a toy prompt. The author said the model handled long research tasks with many tool calls, including investigating why R8 was breaking serialization across an Android app. They described fast responses, useful answers, and enough confidence to consider it a daily driver after using Kimi k2.5 through OpenCode zen. The line that carried the community energy was about not sending an entire codebase to random providers and hoping the trust model holds.
The comments immediately added useful friction. One user said that on an RTX 5090, the speed made the overall experience feel unmatched by cloud models. Another argued that context is cheap on Qwen and that 256k is reachable. Others pushed back: it may be quite good, but not actually Claude; and 64k context may be low for agentic coding once a tool loop starts accumulating state.
community discussion noted that the real signal is not a formal win over closed models. It is a threshold signal. Local inference has often been framed as possible but inconvenient. Posts like this suggest that, for some coding workflows, a 30B to 40B-class sparse model on high-memory consumer hardware can feel operational enough to change where developers are willing to run agents.
The caveat is the story. Hardware, quantization, KV cache settings, context length, editor integration, and task shape all matter. The thread’s value is not one claim of parity. It is a practical checklist for evaluating local coding agents: privacy, latency, context cost, tool-call stability, and whether the model can stay useful across real project state.
Related Articles
r/LocalLLaMA liked this comparison because it replaces reputation and anecdote with a more explicit distribution-based yardstick. The post ranks community Qwen3.5-9B GGUF quants by mean KLD versus a BF16 baseline, with Q8_0 variants leading on fidelity and several IQ4/Q5 options standing out on size-to-drift trade-offs.
LocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
Comments (0)
No comments yet. Be the first to comment!