Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup
Original: How to run Qwen 3.5 locally View original →
The Hacker News submission "How to run Qwen 3.5 locally" is less interesting as a headline than as an operations document. The linked Unsloth guide focuses on how to run the Qwen3.5 family on real local hardware, covering 35B-A3B, 27B, 122B-A10B, 397B-A17B, and the smaller 0.8B, 2B, 4B, and 9B models in one place.
The most practical part of the guide is its hardware table. Unsloth lists approximate 4-bit memory requirements of 17 GB for 27B, 22 GB for 35B-A3B, 70 GB for 122B-A10B, and 214 GB for 397B-A17B. The document also says the family supports 256K context across 201 languages and frames the 27B and 35B-A3B models as realistic local options for roughly 22 GB-class unified-memory devices. Its guidance is straightforward: pick 27B if you want somewhat higher accuracy and are memory constrained, or 35B-A3B if you want faster inference.
What the guide actually gives developers
- Per-model memory budgets and quantization choices
- Recommended temperature, top-p, and top-k settings for thinking and non-thinking modes
- Explicit reasoning control through
--chat-template-kwargs '{"enable_thinking":false}' llama.cppbuild instructions and ready-to-runllama-clicommands- Operational notes about refreshed GGUFs, quantization changes, and tool-calling fixes
That emphasis matters. Instead of treating Qwen3.5 as only a benchmark story, the page acts like a deployment cookbook. It walks through cloning and building llama.cpp, downloading GGUFs from Hugging Face, and launching Dynamic 4-bit variants for different use cases. The March 5 update note is also important: Unsloth says users should redownload several Qwen3.5 GGUFs because improved quantization, new imatrix data, and a chat-template fix affect chat, coding, long-context, and tool-calling performance.
There is also a useful caveat for local stack selection. The guide says current Qwen3.5 GGUF builds do not work in Ollama because of separate mmproj vision files, and recommends llama.cpp-compatible backends instead. That makes the HN item valuable not because it announces a new model family, but because it turns local deployment into a checklist: pick a model size, match it to available memory, decide whether reasoning should be enabled, and use a backend that already handles the current GGUF packaging correctly.
Related Articles
A high-signal r/LocalLLaMA benchmark post said moving Qwen 3.5 27B from mainline llama.cpp to ik_llama.cpp raised prompt evaluation from about 43 tok/sec to 1,122 tok/sec on a Blackwell RTX PRO 4000, with generation climbing from 7.5 tok/sec to 26 tok/sec.
A few weeks after release, r/LocalLLaMA is converging on task-specific sampler and reasoning-budget presets for Qwen3.5 rather than one default setup.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.