LocalLLaMA flags a merged llama.cpp update for Qwen-family inference

The LocalLLaMA thread titled "update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next" is a good example of why local model users track runtime projects as closely as model releases. The post points readers to llama.cpp pull request #19504 and says users saw noticeable token-generation improvements for Qwen3.5 and Qwen-Next after updating. The author shared benchmark screenshots and suggested the effect is mainly on CUDA and CPU paths.

The PR itself explains what changed. According to the GitHub description, the update adds a CPU/CUDA implementation of the GATED_DELTA_NET operation used in qwen3next and many newer attention-model variants. The author says the current code is a basic vector or reference implementation rather than the final chunked implementation, but that it already produces correct results and establishes the operator in the inference graph. In other words, this is not only a small optimization pass. It is part of the compatibility work needed for newer Qwen-family architectures to perform correctly in local runtimes.

What the PR and thread together show

GATED_DELTA_NET support has now landed in llama.cpp.
The pull request was merged on March 7, 2026.
The PR includes example benchmarks for qwen3next and qwen35moe workloads.
LocalLLaMA users are already translating that upstream change into practical update advice.

The benchmark numbers in the PR are not universal, but they are still useful context. The author includes CPU examples showing tg32 results of 4.77 t/s for qwen3next 80B-A3B Q2_K and 11.08 t/s for qwen35moe Q4_K, alongside graph-node changes tied to the new operator. Those figures should be read as reference data rather than a promise for every workstation. The more important point is that the runtime now has explicit support for the relevant operation, and community users are reporting visible gains once they pull the latest build.

That is the real lesson from the thread. In local inference, the weight file is only half the story. New attention designs often depend on backend support before they become truly usable. LocalLLaMA is acting as an operational early-warning system here, taking an upstream merge and translating it into a simple recommendation: if you are testing recent Qwen-family models, update llama.cpp before you judge the model. For practitioners, that kind of community filtering is often more useful than another raw benchmark chart.

LocalLLaMA flags a merged llama.cpp update for Qwen-family inference

What the PR and thread together show

Related Articles

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets

Related Articles

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing
LLM Reddit Mar 8, 2026 2 min read

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
LLM Reddit Mar 12, 2026 2 min read

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets
LLM Reddit Mar 20, 2026 2 min read