Qwen 3.5 local guide maps out memory budgets, 256K context, and llama.cpp setup

Original: How to run Qwen 3.5 locally View original →

Read in other languages: 한국어日本語
LLM Mar 8, 2026 By Insights AI (HN) 2 min read 2 views Source

The Hacker News submission "How to run Qwen 3.5 locally" is less interesting as a headline than as an operations document. The linked Unsloth guide focuses on how to run the Qwen3.5 family on real local hardware, covering 35B-A3B, 27B, 122B-A10B, 397B-A17B, and the smaller 0.8B, 2B, 4B, and 9B models in one place.

The most practical part of the guide is its hardware table. Unsloth lists approximate 4-bit memory requirements of 17 GB for 27B, 22 GB for 35B-A3B, 70 GB for 122B-A10B, and 214 GB for 397B-A17B. The document also says the family supports 256K context across 201 languages and frames the 27B and 35B-A3B models as realistic local options for roughly 22 GB-class unified-memory devices. Its guidance is straightforward: pick 27B if you want somewhat higher accuracy and are memory constrained, or 35B-A3B if you want faster inference.

What the guide actually gives developers

  • Per-model memory budgets and quantization choices
  • Recommended temperature, top-p, and top-k settings for thinking and non-thinking modes
  • Explicit reasoning control through --chat-template-kwargs '{"enable_thinking":false}'
  • llama.cpp build instructions and ready-to-run llama-cli commands
  • Operational notes about refreshed GGUFs, quantization changes, and tool-calling fixes

That emphasis matters. Instead of treating Qwen3.5 as only a benchmark story, the page acts like a deployment cookbook. It walks through cloning and building llama.cpp, downloading GGUFs from Hugging Face, and launching Dynamic 4-bit variants for different use cases. The March 5 update note is also important: Unsloth says users should redownload several Qwen3.5 GGUFs because improved quantization, new imatrix data, and a chat-template fix affect chat, coding, long-context, and tool-calling performance.

There is also a useful caveat for local stack selection. The guide says current Qwen3.5 GGUF builds do not work in Ollama because of separate mmproj vision files, and recommends llama.cpp-compatible backends instead. That makes the HN item valuable not because it announces a new model family, but because it turns local deployment into a checklist: pick a model size, match it to available memory, decide whether reasoning should be enabled, and use a backend that already handles the current GGUF packaging correctly.

Share:

Related Articles

LLM Reddit 4d ago 2 min read

A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.