LocalLLaMA、Qwen 3.5 chat template bugがprefix-cache reuseを静かに壊すと指摘

実務的な匂いの強い r/LocalLLaMA の debugging post が注目されている理由は、local agent workflow を遅くしている原因が model weights でも inference engine でもなく、chat template かもしれないと示したからだ。投稿者は M5 Max 上で Qwen 3.5 と oMLX.ai、OpenCode.ai、Pi.dev の組み合わせを最適化する過程で cache miss を追跡し、同様の挙動を llama.cpp など別の backend でも再現したと書いている。

問題のパターンはかなり具体的だ。tool call が続いた長い turn のあと、単純な follow-up question を投げると、prefix cache を再利用せず古い context の大きな部分を再処理してしまうという。投稿によれば原因は、shipped Qwen 3.5 chat template が historical assistant turn に対して reasoning_content が空でも empty <think>...</think> block を出力しうることにある。すると意味的には同じ conversation history でも serialized prompt が変わり、結果として cache prefix matching が壊れて avoidable な recomputation が発生する。

提案されている fix は Jinja template の one-line guard だ。reasoning text が実際に存在するときだけ historical wrapper を出すようにする。Reddit post は upstream discussion にもリンクしており、この問題はすでに Reddit の外でも見えている。Hugging Face discussion の題名は “fix empty historical <think> blocks in chat_template.jinja” で、GitHub issue #1826 も thinking を無効化した際の同種の KV-cache breakage を説明している。thread の commenters も LM Studio や他の stack で似た repeated prefill を見たと述べており、少なくとも一人の early tester は patch 後に follow-up turn がかなり速くなったと書いている。

なぜこの投稿が響いたのか

この post が響いたのは、agent performance がいまや model quality だけでなく tool use 周辺の serialization detail にどれほど左右されるかを露わにしたからだ。長い local coding session では、prefix caching を無効にする template bug 一つで turn ごとに何万 token も無駄になり、十分に強い model でも体感的には壊れて見える。practitioners が知りたいのはまさにこういう話だ。より大きい context window や高速な quantized inference があっても、format bug 一つでその利点は消える。この thread の教訓は単純だ。tool-heavy exchange のあとに Qwen 3.5 が古い context を繰り返し読み直すなら、cache layer を疑う前に chat template を確認すべきだ。

LocalLLaMA、Qwen 3.5 chat template bugがprefix-cache reuseを静かに壊すと指摘

なぜこの投稿が響いたのか

Related Articles

TextGenがネイティブデスクトップアプリに進化——LM Studioのオープンソース対抗馬として再出発

製造終了のIntel OptaneメモリでローカルLLM(1兆パラメータ)を毎秒4トークンで動作

Nemotron 3 Ultra、550B MoEでエージェント推論5倍と30%コスト削減を提示