LocalLLaMA Says a Qwen 3.5 Chat Template Bug Is Quietly Killing Prefix-Cache Reuse
Original: I tracked a major cache reuse issue down to Qwen 3.5’s chat template View original →
A practical r/LocalLLaMA debugging post is getting attention because it points to a surprisingly mundane bottleneck in local agent workflows: not the model weights, not the inference engine, but the chat template. The author says they spent a week investigating cache misses on an M5 Max while using Qwen 3.5 with oMLX.ai and agent tools such as OpenCode.ai and Pi.dev, and saw the same behavior across other backends including llama.cpp.
The failure pattern was specific. After a long turn with tool calls, a simple follow-up question would cause a large chunk of earlier context to be reprocessed instead of reused from the prefix cache. According to the post, the root cause was that the shipped Qwen 3.5 chat template could emit empty historical <think>...</think> blocks for prior assistant turns even when reasoning_content was empty. That changed the serialized prompt for what was logically the same conversation history, which in turn broke cache-prefix matching and forced avoidable recomputation.
The proposed fix is a one-line guard in the Jinja template so the historical wrapper is emitted only when reasoning text is actually present. The Reddit post links to upstream discussions, and the broader issue is already visible outside Reddit: a Hugging Face discussion for Qwen3.5-122B-A10B is titled “fix empty historical <think> blocks in chat_template.jinja,” while GitHub issue #1826 describes the same class of KV-cache breakage when thinking is disabled. Commenters on the Reddit thread reported seeing similar repeated prefills in LM Studio and other stacks, and at least one early tester said follow-up turns felt much faster after applying the change.
Why this resonated
The post landed because it exposes how much agent performance now depends on serialization details around tool use, not just on model quality. In long-running local coding workflows, a template bug that invalidates prefix caching can waste tens of thousands of tokens per turn and make a capable model feel broken. That is exactly the kind of issue practitioners care about. A small formatting mistake can erase the practical benefit of larger context windows, faster quantized inference, and better tool calling. The lesson from the thread is straightforward: if Qwen 3.5 sessions keep reparsing old context after tool-heavy exchanges, check the chat template before blaming the cache layer.
Related Articles
A practical HN gist lays out how to run Ollama and Gemma 4 on an Apple Silicon Mac mini, including auto-start, periodic preload, and `OLLAMA_KEEP_ALIVE=-1`. The author says `gemma4:26b` nearly exhausted 24GB unified memory, making the default 8B model a safer operational choice.
A strong r/LocalLLaMA reaction suggests PrismML’s Bonsai launch is landing as more than another compression headline. The discussion combines the company’s end-to-end 1-bit claims with early hands-on reports that the models feel materially more usable than earlier BitNet-style experiments.
A recent LocalLLaMA discussion shared results from Mac LLM Bench, an open benchmark workflow for Apple Silicon systems. The most useful takeaway is practical: dense 32B models hit a clear wall on a 32 GB MacBook Air M5, while some MoE models offer a much better latency-to-capability tradeoff.
Comments (0)
No comments yet. Be the first to comment!