A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.
#qwen
RSS FeedA recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
A high-ranking Hacker News thread amplified Apple's paper on simple self-distillation for code generation, a training recipe that improves pass@1 without verifier models or reinforcement learning.
A LocalLLaMA post claiming a patched llama.cpp could run Qwen 3.5-9B on a MacBook Air M4 with 16 GB memory and a 20,000-token context passed 1,159 upvotes and 193 comments in this April 4, 2026 crawl, making TurboQuant a live local-inference discussion rather than just a research headline.
Alibaba Cloud positioned Qwen3.6-Plus as a 1M-context model for agentic coding, tool use, and multimodal reasoning, and Hacker News quickly surfaced it as a high-interest AI launch.
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
A Reddit thread in r/LocalLLaMA drew 142 upvotes and 29 comments around CoPaw-9B. The discussion focused on its Qwen3.5-based 9B agent positioning, 262,144-token context window, and whether local users would get GGUF or other quantized builds quickly.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.
A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.
A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.
David Noel Ng's follow-up post treats layer duplication as a search problem rather than a lucky trick, then ties it to multilingual hidden-state evidence that the middle of the network may host a shared reasoning space.