Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup
Original: How to run Qwen 3.5 locally View original →
The Hacker News submission "How to run Qwen 3.5 locally" is less interesting as a headline than as an operations document. The linked Unsloth guide focuses on how to run the Qwen3.5 family on real local hardware, covering 35B-A3B, 27B, 122B-A10B, 397B-A17B, and the smaller 0.8B, 2B, 4B, and 9B models in one place.
The most practical part of the guide is its hardware table. Unsloth lists approximate 4-bit memory requirements of 17 GB for 27B, 22 GB for 35B-A3B, 70 GB for 122B-A10B, and 214 GB for 397B-A17B. The document also says the family supports 256K context across 201 languages and frames the 27B and 35B-A3B models as realistic local options for roughly 22 GB-class unified-memory devices. Its guidance is straightforward: pick 27B if you want somewhat higher accuracy and are memory constrained, or 35B-A3B if you want faster inference.
What the guide actually gives developers
- Per-model memory budgets and quantization choices
- Recommended temperature, top-p, and top-k settings for thinking and non-thinking modes
- Explicit reasoning control through
--chat-template-kwargs '{"enable_thinking":false}' llama.cppbuild instructions and ready-to-runllama-clicommands- Operational notes about refreshed GGUFs, quantization changes, and tool-calling fixes
That emphasis matters. Instead of treating Qwen3.5 as only a benchmark story, the page acts like a deployment cookbook. It walks through cloning and building llama.cpp, downloading GGUFs from Hugging Face, and launching Dynamic 4-bit variants for different use cases. The March 5 update note is also important: Unsloth says users should redownload several Qwen3.5 GGUFs because improved quantization, new imatrix data, and a chat-template fix affect chat, coding, long-context, and tool-calling performance.
There is also a useful caveat for local stack selection. The guide says current Qwen3.5 GGUF builds do not work in Ollama because of separate mmproj vision files, and recommends llama.cpp-compatible backends instead. That makes the HN item valuable not because it announces a new model family, but because it turns local deployment into a checklist: pick a model size, match it to available memory, decide whether reasoning should be enabled, and use a backend that already handles the current GGUF packaging correctly.
Related Articles
A high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.
Comments (0)
No comments yet. Be the first to comment!