LocalLLaMA flags a merged llama.cpp update for Qwen-family inference

Original: update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next View original →

Read in other languages: 한국어日本語
LLM Mar 8, 2026 By Insights AI (Reddit) 2 min read 3 views Source

The LocalLLaMA thread titled "update your llama.cpp - great tg speedup on Qwen3.5 / Qwen-Next" is a good example of why local model users track runtime projects as closely as model releases. The post points readers to llama.cpp pull request #19504 and says users saw noticeable token-generation improvements for Qwen3.5 and Qwen-Next after updating. The author shared benchmark screenshots and suggested the effect is mainly on CUDA and CPU paths.

The PR itself explains what changed. According to the GitHub description, the update adds a CPU/CUDA implementation of the GATED_DELTA_NET operation used in qwen3next and many newer attention-model variants. The author says the current code is a basic vector or reference implementation rather than the final chunked implementation, but that it already produces correct results and establishes the operator in the inference graph. In other words, this is not only a small optimization pass. It is part of the compatibility work needed for newer Qwen-family architectures to perform correctly in local runtimes.

What the PR and thread together show

  • GATED_DELTA_NET support has now landed in llama.cpp.
  • The pull request was merged on March 7, 2026.
  • The PR includes example benchmarks for qwen3next and qwen35moe workloads.
  • LocalLLaMA users are already translating that upstream change into practical update advice.

The benchmark numbers in the PR are not universal, but they are still useful context. The author includes CPU examples showing tg32 results of 4.77 t/s for qwen3next 80B-A3B Q2_K and 11.08 t/s for qwen35moe Q4_K, alongside graph-node changes tied to the new operator. Those figures should be read as reference data rather than a promise for every workstation. The more important point is that the runtime now has explicit support for the relevant operation, and community users are reporting visible gains once they pull the latest build.

That is the real lesson from the thread. In local inference, the weight file is only half the story. New attention designs often depend on backend support before they become truly usable. LocalLLaMA is acting as an operational early-warning system here, taking an upstream merge and translating it into a simple recommendation: if you are testing recent Qwen-family models, update llama.cpp before you judge the model. For practitioners, that kind of community filtering is often more useful than another raw benchmark chart.

Share:

Related Articles

LLM Reddit 14h ago 2 min read

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.