A Qwen3.6 tuning post made --n-cpu-moe the LocalLLaMA knob of the day
Original: RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. View original →
A r/LocalLLaMA post put Qwen3.6-35B-A3B tuning into the form the community likes best: hardware, flags, and tokens per second. The author used an RTX 5070 Ti with 16GB VRAM, a Ryzen 9800X3D, 32GB DDR5, llama.cpp b8829, and unsloth/Qwen3.6-35B-A3B-GGUF at UD-Q4_K_M. The headline number was roughly 79 t/s with 128K context.
The finding centered on --cpu-moe versus --n-cpu-moe N. According to the post, the common --cpu-moe approach pushes all MoE experts to CPU and leaves much of the GPU underused. The baseline was 51.2 generation t/s, 87.9 prompt t/s, and 3.5GB VRAM use. With --n-cpu-moe 20, the result rose to 78.7 generation t/s, 100.6 prompt t/s, and 12.7GB VRAM use.
Adding -np 1 and 128K context produced 79.3 generation t/s, 135.8 prompt t/s, and 13.2GB VRAM use in the author’s run. The post summarized the gain as about 54% over the naive --cpu-moe path. That is why the thread became less about Qwen hype and more about how sparse MoE layers are placed across CPU and GPU.
The comments added useful caution. Some users pointed to --fit on, --fit-ctx 128000, and --fit-target 512 as a simpler route for their own setups. That matters: this is one hardware and software configuration, not a universal benchmark. GPU generation, VRAM, quant, llama.cpp build, context length, and batching can all change the result.
Still, the post earned attention because it showed a knob that local users can test immediately. For local LLMs, usability is often decided less by the model card than by runtime placement, memory pressure, and a few flags that turn idle VRAM into throughput.
Related Articles
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
LocalLLaMA reacted because the joke-like idea of an LLM tuning its own runtime came with concrete benchmark numbers. The author says llm-server v2 adds --ai-tune, feeding llama-server help into a tuning loop that searches flag combinations and caches the fastest config; on their rig, Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 40.05 tok/s.
The LocalLLaMA thread cared less about a release headline and more about which Qwen3.6 GGUF quant actually works. Unsloth’s benchmark post pushed the discussion into KLD, disk size, CUDA 13.2 failures, and the messy details that decide local inference quality.
Comments (0)
No comments yet. Be the first to comment!