Skip to content
Decaying

LocalLLaMA Says a Qwen 3.5 Chat Template Bug Is Quietly Killing Prefix-Cache Reuse

Original: I tracked a major cache reuse issue down to Qwen 3.5’s chat template View original →

Read in other languages: 한국어日本語
LLM Apr 9, 2026 By Insights AI (Reddit) 2 min read 30 views Source

A practical r/LocalLLaMA debugging post is getting attention because it points to a surprisingly mundane bottleneck in local agent workflows: not the model weights, not the inference engine, but the chat template. The author says they spent a week investigating cache misses on an M5 Max while using Qwen 3.5 with oMLX.ai and agent tools such as OpenCode.ai and Pi.dev, and saw the same behavior across other backends including llama.cpp.

The failure pattern was specific. After a long turn with tool calls, a simple follow-up question would cause a large chunk of earlier context to be reprocessed instead of reused from the prefix cache. According to the post, the root cause was that the shipped Qwen 3.5 chat template could emit empty historical <think>...</think> blocks for prior assistant turns even when reasoning_content was empty. That changed the serialized prompt for what was logically the same conversation history, which in turn broke cache-prefix matching and forced avoidable recomputation.

The proposed fix is a one-line guard in the Jinja template so the historical wrapper is emitted only when reasoning text is actually present. The Reddit post links to upstream discussions, and the broader issue is already visible outside Reddit: a Hugging Face discussion for Qwen3.5-122B-A10B is titled “fix empty historical <think> blocks in chat_template.jinja,” while GitHub issue #1826 describes the same class of KV-cache breakage when thinking is disabled. Commenters on the Reddit thread reported seeing similar repeated prefills in LM Studio and other stacks, and at least one early tester said follow-up turns felt much faster after applying the change.

Why this resonated

The post landed because it exposes how much agent performance now depends on serialization details around tool use, not just on model quality. In long-running local coding workflows, a template bug that invalidates prefix caching can waste tens of thousands of tokens per turn and make a capable model feel broken. That is exactly the kind of issue practitioners care about. A small formatting mistake can erase the practical benefit of larger context windows, faster quantized inference, and better tool calling. The lesson from the thread is straightforward: if Qwen 3.5 sessions keep reparsing old context after tool-heavy exchanges, check the chat template before blaming the cache layer.

Share: Long

Related Articles