r/LocalLLaMA Focuses on a Qwen3.5-27B + llama.cpp + OpenCode Stack That Actually Works
Original: Running Qwen3.5-27B locally as the primary model in OpenCode View original →
A practical local coding stack
A March 2026 r/LocalLLaMA post pushed a detailed deployment guide for using Qwen3.5-27B as the primary model in OpenCode, reaching 126 points and 45 comments at crawl time. The setup is concrete: an RTX 4090 workstation runs a quantized Qwen3.5-27B GGUF through llama.cpp, a MacBook acts as the client, and Tailscale exposes the model over a private network. The guide also explicitly targets agentic coding use with OpenCode and Codex, which is why it resonated with the community.
The details that usually break local setups
The post is valuable because it focuses on the failure points that generic run-this-model-locally guides often skip. It recommends building llama.cpp with CUDA support, downloading the unsloth/Qwen3.5-27B-GGUF weights plus the mmproj-F16 file, and testing llama-server locally before binding it to a Tailscale address. More importantly, it calls out a corrected Jinja chat template to fix system-message ordering problems that can break tool use in OpenCode and Codex.
- ctx-size 65536 is used instead of the model’s default 262K metadata context, which would OOM a 24 GB card
- parallel 1 is recommended because each extra slot reserves another KV cache
- cache-type-k bf16, cache-type-v bf16, and flash attention are used to keep VRAM usage manageable
- the author reports roughly 22 GB VRAM usage at 65,536 context on an RTX 4090
Why the guide matters
The tutorial also explains less obvious runtime tradeoffs. Ubatch size mainly affects prompt-ingestion spikes, context-shift can silently trim early instructions when context fills up, and overriding the embedded chat template means future GGUF template fixes will not arrive automatically. Those are precisely the sorts of operational details that decide whether a local LLM setup is a demo or a real daily tool.
That is why the LocalLLaMA reaction matters. The community is no longer satisfied with raw benchmark talk or simple it-runs-on-my-machine posts. What readers want is reliable, reproducible guidance for turning open models into usable coding infrastructure. This guide fits that demand because it connects model choice, network exposure, template correction, and VRAM management into one workflow. In practice, that is the difference between having a local model available and having a local model that an agent can actually use productively.
Primary source: Aayush Garg’s guide. Community discussion: r/LocalLLaMA.
Related Articles
A busy LocalLLaMA thread followed David Noel Ng’s RYS II results, which argue that repeated mid-stack transformer layers can still improve Qwen3.5-27B and that hidden states may align more by meaning than by surface language.
Alibaba's Qwen team has released Qwen 3.5 Small, a new small dense model in their flagship open-source series. The announcement topped r/LocalLLaMA with over 1,000 upvotes, reflecting the local AI community's enthusiasm for capable small models.
A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.
Comments (0)
No comments yet. Be the first to comment!