LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM

The “Hot Experts” LocalLLaMA post hit a concrete bottleneck: running a large MoE model when the GPU has enough VRAM for some of the workload, but not enough for everything. The author tested Qwen3.5-122B-A10B on an RTX 4090 24GB, Ryzen 9 7950X, and 96GB RAM, then described all-CPU experts at roughly 15 tok/s as noticeably rough for streaming responses.

The proposed fix is a dynamic expert cache. Over the past N tokens, the runtime tracks which experts are routed to most often. It keeps those “hot” experts in VRAM and leaves colder expert tensors in system RAM, then rebalances periodically. The bet is that the gain from running frequently used experts on the GPU outweighs the cost of moving tensors between system RAM and VRAM. The code is published as a llama.cpp fork.

The reported numbers are specific enough to make the thread useful. With all experts on CPU, token generation averaged about 15.65 tok/s. A layer-based offload using 22.6GB VRAM reached about 17.87 tok/s. The hot expert cache, using 22.2GB VRAM with 44 expert slots and a rebalance interval of 60, produced generation runs of 22.26, 22.97, and 22.77 tok/s. The author summarized that as 44.8% faster than the all-CPU expert baseline and 26.8% faster than layer-based offload at a similar VRAM commitment.

The comments quickly turned from applause to methodology. One user pointed to llama-server fit options and warned about graph splits from non-consecutive layer placement. Others asked about MoE-specific flags, static selection based on imatrix, and whether this overlaps with projects such as PowerInfer. Another commenter wanted latency split by prompt processing versus generation, because some optimizations look strong until prompt-heavy workloads dominate.

The broader point is that MoE inference makes memory placement a first-class tuning problem. Since not every expert is active for every token, deciding which experts deserve the fast memory can matter as much as deciding how many layers to offload. On a non-unified-memory PC, PCIe movement and system RAM access are real taxes. Whether this fork becomes an upstream feature is uncertain, but the community signal is clear: local LLM optimization is moving deeper into the memory hierarchy.

LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM

Related Articles

LocalLLaMA Gets Excited About an LLM That Tunes Its Own llama.cpp Flags

LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds

r/LocalLLaMA Focuses on a Qwen3.5-27B + llama.cpp + OpenCode Stack That Actually Works

Related Articles

LocalLLaMA Gets Excited About an LLM That Tunes Its Own llama.cpp Flags
LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds
LLM Reddit Feb 26, 2026 2 min read

r/LocalLLaMA Focuses on a Qwen3.5-27B + llama.cpp + OpenCode Stack That Actually Works
LLM Reddit Mar 30, 2026 2 min read