A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day

A r/LocalLLaMA post put Qwen3.6-35B-A3B tuning into the form the community likes best: hardware, flags, and tokens per second. The author used an RTX 5070 Ti with 16GB VRAM, a Ryzen 9800X3D, 32GB DDR5, llama.cpp b8829, and unsloth/Qwen3.6-35B-A3B-GGUF at UD-Q4_K_M. The headline number was roughly 79 t/s with 128K context.

The finding centered on --cpu-moe versus --n-cpu-moe N. According to the post, the common --cpu-moe approach pushes all MoE experts to CPU and leaves much of the GPU underused. The baseline was 51.2 generation t/s, 87.9 prompt t/s, and 3.5GB VRAM use. With --n-cpu-moe 20, the result rose to 78.7 generation t/s, 100.6 prompt t/s, and 12.7GB VRAM use.

Adding -np 1 and 128K context produced 79.3 generation t/s, 135.8 prompt t/s, and 13.2GB VRAM use in the author’s run. The post summarized the gain as about 54% over the naive --cpu-moe path. That is why the thread became less about Qwen hype and more about how sparse MoE layers are placed across CPU and GPU.

The comments added useful caution. Some users pointed to --fit on, --fit-ctx 128000, and --fit-target 512 as a simpler route for their own setups. That matters: this is one hardware and software configuration, not a universal benchmark. GPU generation, VRAM, quant, llama.cpp build, context length, and batching can all change the result.

Still, the post earned attention because it showed a knob that local users can test immediately. For local LLMs, usability is often decided less by the model card than by runtime placement, memory pressure, and a few flags that turn idle VRAM into throughput.

LLM Reddit Feb 26, 2026 2 min read

LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds

A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.

#qwen #local-llm #llama-cpp

114

LLM Reddit Mar 30, 2026 2 min read

r/LocalLLaMA Focuses on a Qwen3.5-27B + llama.cpp + OpenCode Stack That Actually Works

A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.

#qwen #llama-cpp #opencode

100

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet

A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.

#qwen #local-llm #llama-cpp

Related Articles

LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds

r/LocalLLaMA Focuses on a Qwen3.5-27B + llama.cpp + OpenCode Stack That Actually Works

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet