LocalLLaMA flags a merged llama.cpp update for Qwen-family inference

The LocalLLaMA thread titled "update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next" is a good example of why local model users track runtime projects as closely as model releases. The post points readers to llama.cpp pull request #19504 and says users saw noticeable token-generation improvements for Qwen3.5 and Qwen-Next after updating. The author shared benchmark screenshots and suggested the effect is mainly on CUDA and CPU paths.

The PR itself explains what changed. According to the GitHub description, the update adds a CPU/CUDA implementation of the GATED_DELTA_NET operation used in qwen3next and many newer attention-model variants. The author says the current code is a basic vector or reference implementation rather than the final chunked implementation, but that it already produces correct results and establishes the operator in the inference graph. In other words, this is not only a small optimization pass. It is part of the compatibility work needed for newer Qwen-family architectures to perform correctly in local runtimes.

What the PR and thread together show

GATED_DELTA_NET support has now landed in llama.cpp.
The pull request was merged on March 7, 2026.
The PR includes example benchmarks for qwen3next and qwen35moe workloads.
LocalLLaMA users are already translating that upstream change into practical update advice.

The benchmark numbers in the PR are not universal, but they are still useful context. The author includes CPU examples showing tg32 results of 4.77 t/s for qwen3next 80B-A3B Q2_K and 11.08 t/s for qwen35moe Q4_K, alongside graph-node changes tied to the new operator. Those figures should be read as reference data rather than a promise for every workstation. The more important point is that the runtime now has explicit support for the relevant operation, and community users are reporting visible gains once they pull the latest build.

That is the real lesson from the thread. In local inference, the weight file is only half the story. New attention designs often depend on backend support before they become truly usable. LocalLLaMA is acting as an operational early-warning system here, taking an upstream merge and translating it into a simple recommendation: if you are testing recent Qwen-family models, update llama.cpp before you judge the model. For practitioners, that kind of community filtering is often more useful than another raw benchmark chart.

LocalLLaMA flags a merged llama.cpp update for Qwen-family inference

What the PR and thread together show

Related Articles

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching

llama.cpp’s Speculative Checkpointing Turned Local Inference Into a Parameter Hunt

Comments (0)

Leave a Comment

Related Articles

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality
LLM Reddit Apr 1, 2026 2 min read

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching
LLM Reddit Mar 31, 2026 2 min read

llama.cpp’s Speculative Checkpointing Turned Local Inference Into a Parameter Hunt
LLM Reddit Apr 20, 2026 1 min read