llama.cpp --fit made LocalLLaMA rethink the VRAM wall
Original: Llama.cpp's auto fit works much better than I expected View original →
A r/LocalLLaMA post about llama.cpp’s --fit option landed because it was practical, not theoretical. The poster said they had assumed 32GB of VRAM limited them to models around 20GB if they wanted usable speed. Then they tested Qwen3.6 Q8 with a 256k context, with weights larger than VRAM, on a 5090 connected over Oculink, and reported 57 tokens per second. The number mattered less than the shift in intuition: local inference may not be as binary as “fits in VRAM” versus “2 t/s pain.”
The comment section immediately turned into tuning notes. One commenter suggested Q8_0 KV-cache quantization, saying it might fit more of the 256k context into VRAM and double throughput. Another pointed out that Qwen3.6 35B is an MoE architecture with roughly 3B active parameters, so dense models like a 27B checkpoint may behave differently. A third user reported a jump from 12 t/s to 48 t/s on a Qwen3.6 35B quant after trying the same idea.
--fitmay reduce the time users spend hand-tuning tensor splits for each model.- KV-cache format, fit target, quantization, and interconnect still shape the final result.
- The MoE versus dense distinction is essential before generalizing the result.
Community discussion also kept the caveat alive: automatic placement does not win in every multi-GPU or multi-machine case, and some users still get better results by splitting models manually. That is what made the thread useful. It was not a miracle claim. It was a reminder that local LLM performance now depends on runtime placement and memory behavior as much as raw VRAM size. For hobbyists and small labs, that makes old capacity assumptions worth retesting.
The practical takeaway is narrow but useful: users who dismissed larger local models because the weights exceeded VRAM may need to retest with current llama.cpp builds, current cache quantization, and the actual context length they plan to use. The outcome will depend on hardware, but the old spreadsheet answer may be stale.
The original thread is on r/LocalLLaMA.
Related Articles
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.
A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.