llama.cpp --fit made LocalLLaMA rethink the VRAM wall

A r/LocalLLaMA post about llama.cpp’s --fit option landed because it was practical, not theoretical. The poster said they had assumed 32GB of VRAM limited them to models around 20GB if they wanted usable speed. Then they tested Qwen3.6 Q8 with a 256k context, with weights larger than VRAM, on a 5090 connected over Oculink, and reported 57 tokens per second. The number mattered less than the shift in intuition: local inference may not be as binary as “fits in VRAM” versus “2 t/s pain.”

The comment section immediately turned into tuning notes. One commenter suggested Q8_0 KV-cache quantization, saying it might fit more of the 256k context into VRAM and double throughput. Another pointed out that Qwen3.6 35B is an MoE architecture with roughly 3B active parameters, so dense models like a 27B checkpoint may behave differently. A third user reported a jump from 12 t/s to 48 t/s on a Qwen3.6 35B quant after trying the same idea.

--fit may reduce the time users spend hand-tuning tensor splits for each model.
KV-cache format, fit target, quantization, and interconnect still shape the final result.
The MoE versus dense distinction is essential before generalizing the result.

Community discussion also kept the caveat alive: automatic placement does not win in every multi-GPU or multi-machine case, and some users still get better results by splitting models manually. That is what made the thread useful. It was not a miracle claim. It was a reminder that local LLM performance now depends on runtime placement and memory behavior as much as raw VRAM size. For hobbyists and small labs, that makes old capacity assumptions worth retesting.

The practical takeaway is narrow but useful: users who dismissed larger local models because the weights exceeded VRAM may need to retest with current llama.cpp builds, current cache quantization, and the actual context length they plan to use. The outcome will depend on hardware, but the old spreadsheet answer may be stale.

The original thread is on r/LocalLLaMA.

llama.cpp --fit made LocalLLaMA rethink the VRAM wall

Related Articles

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp

LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds

r/LocalLLaMA Focuses on a Qwen3.5-27B + llama.cpp + OpenCode Stack That Actually Works

Related Articles

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp
LLM Reddit May 22, 2026 1 min read

LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds
LLM Reddit Feb 26, 2026 2 min read

r/LocalLLaMA Focuses on a Qwen3.5-27B + llama.cpp + OpenCode Stack That Actually Works
LLM Reddit Mar 30, 2026 2 min read