Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

Original: Mac users should update llama.cpp to get a big speed boost on Qwen 3.5 View original →

Read in other languages: 한국어日本語
LLM Mar 12, 2026 By Insights AI (Reddit) 2 min read 1 views Source

What changed in llama.cpp

A focused r/LocalLLaMA thread sent Mac users to llama.cpp pull request #20361, titled metal: add GDN kernel. The pull request was merged on March 11, 2026, and adds a fused GDN recurrent kernel to the Metal backend. Most of the code changes land in the ggml-metal path, including a substantial update to ggml-metal.metal, which signals that this is a backend optimization rather than a small tuning patch.

The benchmark table in the PR explains why the Reddit post moved quickly. For Qwen35 27B Q8_0, the author reports 349.12 to 390.39 tokens per second on pp512 and 363.75 to 406.81 on pp2048, both roughly a 12 percent uplift. On tg32, the same model goes from 17.03 to 20.36 tokens per second, which is about 20 percent. Qwen35moe 35B.A3B Q4_0 shows even larger jumps: 1612.12 to 2058.31 on pp512, 1879.76 to 2462.35 on pp2048, and 57.08 to 77.65 on tg32, landing in the roughly 28 to 36 percent range. The PR also lists bigger gains on Kimi Linear, but the LocalLLaMA audience understandably focused on Qwen 3.5 because that is where many Mac users are actively testing local inference.

What the thread added

Reddit comments added useful operational context. One commenter clarified that the patch had already been merged to master after earlier confusion about a side branch. Another noted that the change might appear in released binaries a bit later than in the merged source tree. The most practical comparison came from a user testing 4-bit Qwen3.5-35B-A3B on an M1 Max with 64 GB of memory, who said MLX still outperformed GGUF in their setup. That does not negate the PR numbers. It frames them correctly. This is a meaningful backend improvement inside llama.cpp, not proof that every Mac inference stack is suddenly equal.

Why it matters

For local-model users, this is the kind of optimization that actually changes day-to-day ergonomics. Small percentage gains matter when they compound across prompt processing, token generation, and long sessions. The thread is also a reminder that the Mac inference ecosystem remains highly competitive. ggml, MLX, quantization format, and model architecture all shape the final experience. What Reddit highlighted here is simple: if you run Qwen 3.5 through llama.cpp on Apple Silicon, March 11, 2026 was a meaningful backend update worth tracking closely.

Source post: r/LocalLLaMA thread. Primary source: llama.cpp PR #20361.

Share:

Related Articles

LLM Reddit 4d ago 2 min read

A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.