LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM
Original: Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload View original →
The “Hot Experts” LocalLLaMA post hit a concrete bottleneck: running a large MoE model when the GPU has enough VRAM for some of the workload, but not enough for everything. The author tested Qwen3.5-122B-A10B on an RTX 4090 24GB, Ryzen 9 7950X, and 96GB RAM, then described all-CPU experts at roughly 15 tok/s as noticeably rough for streaming responses.
The proposed fix is a dynamic expert cache. Over the past N tokens, the runtime tracks which experts are routed to most often. It keeps those “hot” experts in VRAM and leaves colder expert tensors in system RAM, then rebalances periodically. The bet is that the gain from running frequently used experts on the GPU outweighs the cost of moving tensors between system RAM and VRAM. The code is published as a llama.cpp fork.
The reported numbers are specific enough to make the thread useful. With all experts on CPU, token generation averaged about 15.65 tok/s. A layer-based offload using 22.6GB VRAM reached about 17.87 tok/s. The hot expert cache, using 22.2GB VRAM with 44 expert slots and a rebalance interval of 60, produced generation runs of 22.26, 22.97, and 22.77 tok/s. The author summarized that as 44.8% faster than the all-CPU expert baseline and 26.8% faster than layer-based offload at a similar VRAM commitment.
The comments quickly turned from applause to methodology. One user pointed to llama-server fit options and warned about graph splits from non-consecutive layer placement. Others asked about MoE-specific flags, static selection based on imatrix, and whether this overlaps with projects such as PowerInfer. Another commenter wanted latency split by prompt processing versus generation, because some optimizations look strong until prompt-heavy workloads dominate.
The broader point is that MoE inference makes memory placement a first-class tuning problem. Since not every expert is active for every token, deciding which experts deserve the fast memory can matter as much as deciding how many layers to offload. On a non-unified-memory PC, PCIe movement and system RAM access are real taxes. Whether this fork becomes an upstream feature is uncertain, but the community signal is clear: local LLM optimization is moving deeper into the memory hierarchy.
Related Articles
LocalLLaMA reacted because the joke-like idea of an LLM tuning its own runtime came with concrete benchmark numbers. The author says llm-server v2 adds --ai-tune, feeding llama-server help into a tuning loop that searches flag combinations and caches the fastest config; on their rig, Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 40.05 tok/s.
r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.
A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.
Comments (0)
No comments yet. Be the first to comment!