LocalLLaMA flags a merged llama.cpp update for Qwen-family inference
Original: update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next View original →
The LocalLLaMA thread titled "update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next" is a good example of why local model users track runtime projects as closely as model releases. The post points readers to llama.cpp pull request #19504 and says users saw noticeable token-generation improvements for Qwen3.5 and Qwen-Next after updating. The author shared benchmark screenshots and suggested the effect is mainly on CUDA and CPU paths.
The PR itself explains what changed. According to the GitHub description, the update adds a CPU/CUDA implementation of the GATED_DELTA_NET operation used in qwen3next and many newer attention-model variants. The author says the current code is a basic vector or reference implementation rather than the final chunked implementation, but that it already produces correct results and establishes the operator in the inference graph. In other words, this is not only a small optimization pass. It is part of the compatibility work needed for newer Qwen-family architectures to perform correctly in local runtimes.
What the PR and thread together show
GATED_DELTA_NETsupport has now landed inllama.cpp.- The pull request was merged on March 7, 2026.
- The PR includes example benchmarks for qwen3next and qwen35moe workloads.
- LocalLLaMA users are already translating that upstream change into practical update advice.
The benchmark numbers in the PR are not universal, but they are still useful context. The author includes CPU examples showing tg32 results of 4.77 t/s for qwen3next 80B-A3B Q2_K and 11.08 t/s for qwen35moe Q4_K, alongside graph-node changes tied to the new operator. Those figures should be read as reference data rather than a promise for every workstation. The more important point is that the runtime now has explicit support for the relevant operation, and community users are reporting visible gains once they pull the latest build.
That is the real lesson from the thread. In local inference, the weight file is only half the story. New attention designs often depend on backend support before they become truly usable. LocalLLaMA is acting as an operational early-warning system here, taking an upstream merge and translating it into a simple recommendation: if you are testing recent Qwen-family models, update llama.cpp before you judge the model. For practitioners, that kind of community filtering is often more useful than another raw benchmark chart.
Related Articles
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
LocalLLaMA upvoted the merge because it is immediately testable, but the useful caveat was clear: speedups depend heavily on prompt repetition and draft acceptance.
Comments (0)
No comments yet. Be the first to comment!