Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
Original: Mac users should update llama.cpp to get a big speed boost on Qwen 3.5 View original →
What changed in llama.cpp
A focused r/LocalLLaMA thread sent Mac users to llama.cpp pull request #20361, titled metal: add GDN kernel. The pull request was merged on March 11, 2026, and adds a fused GDN recurrent kernel to the Metal backend. Most of the code changes land in the ggml-metal path, including a substantial update to ggml-metal.metal, which signals that this is a backend optimization rather than a small tuning patch.
The benchmark table in the PR explains why the Reddit post moved quickly. For Qwen35 27B Q8_0, the author reports 349.12 to 390.39 tokens per second on pp512 and 363.75 to 406.81 on pp2048, both roughly a 12 percent uplift. On tg32, the same model goes from 17.03 to 20.36 tokens per second, which is about 20 percent. Qwen35moe 35B.A3B Q4_0 shows even larger jumps: 1612.12 to 2058.31 on pp512, 1879.76 to 2462.35 on pp2048, and 57.08 to 77.65 on tg32, landing in the roughly 28 to 36 percent range. The PR also lists bigger gains on Kimi Linear, but the LocalLLaMA audience understandably focused on Qwen 3.5 because that is where many Mac users are actively testing local inference.
What the thread added
Reddit comments added useful operational context. One commenter clarified that the patch had already been merged to master after earlier confusion about a side branch. Another noted that the change might appear in released binaries a bit later than in the merged source tree. The most practical comparison came from a user testing 4-bit Qwen3.5-35B-A3B on an M1 Max with 64 GB of memory, who said MLX still outperformed GGUF in their setup. That does not negate the PR numbers. It frames them correctly. This is a meaningful backend improvement inside llama.cpp, not proof that every Mac inference stack is suddenly equal.
Why it matters
For local-model users, this is the kind of optimization that actually changes day-to-day ergonomics. Small percentage gains matter when they compound across prompt processing, token generation, and long sessions. The thread is also a reminder that the Mac inference ecosystem remains highly competitive. ggml, MLX, quantization format, and model architecture all shape the final experience. What Reddit highlighted here is simple: if you run Qwen 3.5 through llama.cpp on Apple Silicon, March 11, 2026 was a meaningful backend update worth tracking closely.
Source post: r/LocalLLaMA thread. Primary source: llama.cpp PR #20361.
Related Articles
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.
A Launch HN thread pulled RunAnywhere’s MetalRT and RCLI into focus, centering attention on a low-latency STT-LLM-TTS stack that runs on Apple Silicon without cloud APIs.
Comments (0)
No comments yet. Be the first to comment!