Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup

The Hacker News submission "How to run Qwen 3.5 locally" is less interesting as a headline than as an operations document. The linked Unsloth guide focuses on how to run the Qwen3.5 family on real local hardware, covering 35B-A3B, 27B, 122B-A10B, 397B-A17B, and the smaller 0.8B, 2B, 4B, and 9B models in one place.

The most practical part of the guide is its hardware table. Unsloth lists approximate 4-bit memory requirements of 17 GB for 27B, 22 GB for 35B-A3B, 70 GB for 122B-A10B, and 214 GB for 397B-A17B. The document also says the family supports 256K context across 201 languages and frames the 27B and 35B-A3B models as realistic local options for roughly 22 GB-class unified-memory devices. Its guidance is straightforward: pick 27B if you want somewhat higher accuracy and are memory constrained, or 35B-A3B if you want faster inference.

What the guide actually gives developers

Per-model memory budgets and quantization choices
Recommended temperature, top-p, and top-k settings for thinking and non-thinking modes
Explicit reasoning control through --chat-template-kwargs '{"enable_thinking":false}'
llama.cpp build instructions and ready-to-run llama-cli commands
Operational notes about refreshed GGUFs, quantization changes, and tool-calling fixes

That emphasis matters. Instead of treating Qwen3.5 as only a benchmark story, the page acts like a deployment cookbook. It walks through cloning and building llama.cpp, downloading GGUFs from Hugging Face, and launching Dynamic 4-bit variants for different use cases. The March 5 update note is also important: Unsloth says users should redownload several Qwen3.5 GGUFs because improved quantization, new imatrix data, and a chat-template fix affect chat, coding, long-context, and tool-calling performance.

There is also a useful caveat for local stack selection. The guide says current Qwen3.5 GGUF builds do not work in Ollama because of separate mmproj vision files, and recommends llama.cpp-compatible backends instead. That makes the HN item valuable not because it announces a new model family, but because it turns local deployment into a checklist: pick a model size, match it to available memory, decide whether reasoning should be enabled, and use a backend that already handles the current GGUF packaging correctly.

Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup

What the guide actually gives developers

Related Articles

r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching

Related Articles

r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion
LLM Reddit Mar 22, 2026 2 min read

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets
LLM Reddit Mar 20, 2026 2 min read

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching
LLM Reddit Mar 31, 2026 2 min read