A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day
Original: RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. View original →
A r/LocalLLaMA post put Qwen3.6-35B-A3B tuning into the form the community likes best: hardware, flags, and tokens per second. The author used an RTX 5070 Ti with 16GB VRAM, a Ryzen 9800X3D, 32GB DDR5, llama.cpp b8829, and unsloth/Qwen3.6-35B-A3B-GGUF at UD-Q4_K_M. The headline number was roughly 79 t/s with 128K context.
The finding centered on --cpu-moe versus --n-cpu-moe N. According to the post, the common --cpu-moe approach pushes all MoE experts to CPU and leaves much of the GPU underused. The baseline was 51.2 generation t/s, 87.9 prompt t/s, and 3.5GB VRAM use. With --n-cpu-moe 20, the result rose to 78.7 generation t/s, 100.6 prompt t/s, and 12.7GB VRAM use.
Adding -np 1 and 128K context produced 79.3 generation t/s, 135.8 prompt t/s, and 13.2GB VRAM use in the author’s run. The post summarized the gain as about 54% over the naive --cpu-moe path. That is why the thread became less about Qwen hype and more about how sparse MoE layers are placed across CPU and GPU.
The comments added useful caution. Some users pointed to --fit on, --fit-ctx 128000, and --fit-target 512 as a simpler route for their own setups. That matters: this is one hardware and software configuration, not a universal benchmark. GPU generation, VRAM, quant, llama.cpp build, context length, and batching can all change the result.
Still, the post earned attention because it showed a knob that local users can test immediately. For local LLMs, usability is often decided less by the model card than by runtime placement, memory pressure, and a few flags that turn idle VRAM into throughput.
Related Articles
A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.
Comments (0)
No comments yet. Be the first to comment!