Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

What changed in llama.cpp

A focused r/LocalLLaMA thread sent Mac users to llama.cpp pull request #20361, titled metal: add GDN kernel. The pull request was merged on March 11, 2026, and adds a fused GDN recurrent kernel to the Metal backend. Most of the code changes land in the ggml-metal path, including a substantial update to ggml-metal.metal, which signals that this is a backend optimization rather than a small tuning patch.

The benchmark table in the PR explains why the Reddit post moved quickly. For Qwen35 27B Q8_0, the author reports 349.12 to 390.39 tokens per second on pp512 and 363.75 to 406.81 on pp2048, both roughly a 12 percent uplift. On tg32, the same model goes from 17.03 to 20.36 tokens per second, which is about 20 percent. Qwen35moe 35B.A3B Q4_0 shows even larger jumps: 1612.12 to 2058.31 on pp512, 1879.76 to 2462.35 on pp2048, and 57.08 to 77.65 on tg32, landing in the roughly 28 to 36 percent range. The PR also lists bigger gains on Kimi Linear, but the LocalLLaMA audience understandably focused on Qwen 3.5 because that is where many Mac users are actively testing local inference.

What the thread added

Reddit comments added useful operational context. One commenter clarified that the patch had already been merged to master after earlier confusion about a side branch. Another noted that the change might appear in released binaries a bit later than in the merged source tree. The most practical comparison came from a user testing 4-bit Qwen3.5-35B-A3B on an M1 Max with 64 GB of memory, who said MLX still outperformed GGUF in their setup. That does not negate the PR numbers. It frames them correctly. This is a meaningful backend improvement inside llama.cpp, not proof that every Mac inference stack is suddenly equal.

Why it matters

For local-model users, this is the kind of optimization that actually changes day-to-day ergonomics. Small percentage gains matter when they compound across prompt processing, token generation, and long sessions. The thread is also a reminder that the Mac inference ecosystem remains highly competitive. ggml, MLX, quantization format, and model architecture all shape the final experience. What Reddit highlighted here is simple: if you run Qwen 3.5 through llama.cpp on Apple Silicon, March 11, 2026 was a meaningful backend update worth tracking closely.

Source post: r/LocalLLaMA thread. Primary source: llama.cpp PR #20361.

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

What changed in llama.cpp

What the thread added

Why it matters

Related Articles

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets

Related Articles

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max
LLM Reddit Mar 30, 2026 2 min read

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing
LLM Reddit Mar 8, 2026 2 min read

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets
LLM Reddit Mar 20, 2026 2 min read