llama.cpp --fit made LocalLLaMA rethink the VRAM wall
Original: Llama.cpp's auto fit works much better than I expected View original →
A r/LocalLLaMA post about llama.cpp’s --fit option landed because it was practical, not theoretical. The poster said they had assumed 32GB of VRAM limited them to models around 20GB if they wanted usable speed. Then they tested Qwen3.6 Q8 with a 256k context, with weights larger than VRAM, on a 5090 connected over Oculink, and reported 57 tokens per second. The number mattered less than the shift in intuition: local inference may not be as binary as “fits in VRAM” versus “2 t/s pain.”
The comment section immediately turned into tuning notes. One commenter suggested Q8_0 KV-cache quantization, saying it might fit more of the 256k context into VRAM and double throughput. Another pointed out that Qwen3.6 35B is an MoE architecture with roughly 3B active parameters, so dense models like a 27B checkpoint may behave differently. A third user reported a jump from 12 t/s to 48 t/s on a Qwen3.6 35B quant after trying the same idea.
--fitmay reduce the time users spend hand-tuning tensor splits for each model.- KV-cache format, fit target, quantization, and interconnect still shape the final result.
- The MoE versus dense distinction is essential before generalizing the result.
Community discussion also kept the caveat alive: automatic placement does not win in every multi-GPU or multi-machine case, and some users still get better results by splitting models manually. That is what made the thread useful. It was not a miracle claim. It was a reminder that local LLM performance now depends on runtime placement and memory behavior as much as raw VRAM size. For hobbyists and small labs, that makes old capacity assumptions worth retesting.
The practical takeaway is narrow but useful: users who dismissed larger local models because the weights exceeded VRAM may need to retest with current llama.cpp builds, current cache quantization, and the actual context length they plan to use. The outcome will depend on hardware, but the old spreadsheet answer may be stale.
The original thread is on r/LocalLLaMA.
Related Articles
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
LocalLLaMA reacted because the joke-like idea of an LLM tuning its own runtime came with concrete benchmark numbers. The author says llm-server v2 adds --ai-tune, feeding llama-server help into a tuning loop that searches flag combinations and caches the fastest config; on their rig, Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 40.05 tok/s.
r/LocalLLaMA cared because the numbers were concrete: 79 t/s on an RTX 5070 Ti with 128K context, tied to one llama.cpp flag choice.
Comments (0)
No comments yet. Be the first to comment!