llama.cpp --fit made LocalLLaMA rethink the VRAM wall

Original: Llama.cpp's auto fit works much better than I expected View original →

Read in other languages: 한국어日本語
LLM Apr 22, 2026 By Insights AI (Reddit) 2 min read 2 views Source

A r/LocalLLaMA post about llama.cpp’s --fit option landed because it was practical, not theoretical. The poster said they had assumed 32GB of VRAM limited them to models around 20GB if they wanted usable speed. Then they tested Qwen3.6 Q8 with a 256k context, with weights larger than VRAM, on a 5090 connected over Oculink, and reported 57 tokens per second. The number mattered less than the shift in intuition: local inference may not be as binary as “fits in VRAM” versus “2 t/s pain.”

The comment section immediately turned into tuning notes. One commenter suggested Q8_0 KV-cache quantization, saying it might fit more of the 256k context into VRAM and double throughput. Another pointed out that Qwen3.6 35B is an MoE architecture with roughly 3B active parameters, so dense models like a 27B checkpoint may behave differently. A third user reported a jump from 12 t/s to 48 t/s on a Qwen3.6 35B quant after trying the same idea.

  • --fit may reduce the time users spend hand-tuning tensor splits for each model.
  • KV-cache format, fit target, quantization, and interconnect still shape the final result.
  • The MoE versus dense distinction is essential before generalizing the result.

Community discussion also kept the caveat alive: automatic placement does not win in every multi-GPU or multi-machine case, and some users still get better results by splitting models manually. That is what made the thread useful. It was not a miracle claim. It was a reminder that local LLM performance now depends on runtime placement and memory behavior as much as raw VRAM size. For hobbyists and small labs, that makes old capacity assumptions worth retesting.

The practical takeaway is narrow but useful: users who dismissed larger local models because the weights exceeded VRAM may need to retest with current llama.cpp builds, current cache quantization, and the actual context length they plan to use. The outcome will depend on hardware, but the old spreadsheet answer may be stale.

The original thread is on r/LocalLLaMA.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.