Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup
Original: How to run Qwen 3.5 locally View original →
The Hacker News submission "How to run Qwen 3.5 locally" is less interesting as a headline than as an operations document. The linked Unsloth guide focuses on how to run the Qwen3.5 family on real local hardware, covering 35B-A3B, 27B, 122B-A10B, 397B-A17B, and the smaller 0.8B, 2B, 4B, and 9B models in one place.
The most practical part of the guide is its hardware table. Unsloth lists approximate 4-bit memory requirements of 17 GB for 27B, 22 GB for 35B-A3B, 70 GB for 122B-A10B, and 214 GB for 397B-A17B. The document also says the family supports 256K context across 201 languages and frames the 27B and 35B-A3B models as realistic local options for roughly 22 GB-class unified-memory devices. Its guidance is straightforward: pick 27B if you want somewhat higher accuracy and are memory constrained, or 35B-A3B if you want faster inference.
What the guide actually gives developers
- Per-model memory budgets and quantization choices
- Recommended temperature, top-p, and top-k settings for thinking and non-thinking modes
- Explicit reasoning control through
--chat-template-kwargs '{"enable_thinking":false}' llama.cppbuild instructions and ready-to-runllama-clicommands- Operational notes about refreshed GGUFs, quantization changes, and tool-calling fixes
That emphasis matters. Instead of treating Qwen3.5 as only a benchmark story, the page acts like a deployment cookbook. It walks through cloning and building llama.cpp, downloading GGUFs from Hugging Face, and launching Dynamic 4-bit variants for different use cases. The March 5 update note is also important: Unsloth says users should redownload several Qwen3.5 GGUFs because improved quantization, new imatrix data, and a chat-template fix affect chat, coding, long-context, and tool-calling performance.
There is also a useful caveat for local stack selection. The guide says current Qwen3.5 GGUF builds do not work in Ollama because of separate mmproj vision files, and recommends llama.cpp-compatible backends instead. That makes the HN item valuable not because it announces a new model family, but because it turns local deployment into a checklist: pick a model size, match it to available memory, decide whether reasoning should be enabled, and use a backend that already handles the current GGUF packaging correctly.
Related Articles
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
r/LocalLLaMA liked this comparison because it replaces reputation and anecdote with a more explicit distribution-based yardstick. The post ranks community Qwen3.5-9B GGUF quants by mean KLD versus a BF16 baseline, with Q8_0 variants leading on fidelity and several IQ4/Q5 options standing out on size-to-drift trade-offs.
LocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.
Comments (0)
No comments yet. Be the first to comment!