LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes
Original: I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed View original →
A r/LocalLLaMA benchmark post hit the community's sweet spot: models tested on real consumer hardware, with code quality, speed, and memory shown together. The author said the goal was to replace “trust me” coding-model recommendations with a direct comparison, using HumanEval+ pass@1 across 164 coding problems plus token speed and VRAM footprint on a MacBook Air M5.
The headline result is Qwen 3.6 35B-A3B. In the author's table, the MoE model scored 89.6% on HumanEval+, generated at 16.9 tok/s, and used 20.1 GB. Qwen 2.5 Coder 32B followed closely on quality at 87.2%, but ran at only 2.5 tok/s. Qwen 2.5 Coder 7B looked like the practical budget pick: 84.2% at 11.3 tok/s in 4.5 GB. For a local daily coding assistant, that speed and memory profile can matter as much as the top score.
The most debated section was the poor Gemma 4 result. Gemma 4 31B came in at 31.1%, Gemma 4 E4B at 14.6%, and Gemma 4 26B-A4B MoE at 12.2%. The author said the Q4_K_M quantization may be hurting Gemma 4 more than other architectures, or HumanEval+ may not reflect its strengths. Comments added another practical theory: recent Gemma 4 tool-calling issues can cause premature stops near tool calls, and fixes in Google and llama.cpp code paths may not fully remove the issue.
That back-and-forth is what makes the post useful. It does not turn one benchmark into a universal ranking. It shows that local model choice is a hardware and workload question. A model can be impressive on a hosted leaderboard and still be wrong for a laptop. Another model can be less fashionable and still win because it fits in memory, generates fast enough, and passes the user's test suite.
The original discussion is on Reddit, with the author's Medium writeup, GitHub repo, and Hugging Face dataset linked from the post. The community reaction is a reminder that local LLM users increasingly want reproducible numbers, not broad claims about which model “feels” better.
Related Articles
LocalLLaMA treated this less as a speed chart and more as a question about completion quality under a messy real prompt. On the same MacBook Pro M5 Max, Qwen 3.6 27B wrote more and faster, but Gemma 4 31B finished the game logic with far fewer tokens.
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
A viral LocalLLaMA post describes how Qwen3.6 35B A3B transformed complex workflows by combining Codex for task execution with skill documentation, feeding those skills to the pi agent — automating VPS management, PDF conversion, and more.