LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes
Original: I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed View original →
A r/LocalLLaMA benchmark post hit the community's sweet spot: models tested on real consumer hardware, with code quality, speed, and memory shown together. The author said the goal was to replace “trust me” coding-model recommendations with a direct comparison, using HumanEval+ pass@1 across 164 coding problems plus token speed and VRAM footprint on a MacBook Air M5.
The headline result is Qwen 3.6 35B-A3B. In the author's table, the MoE model scored 89.6% on HumanEval+, generated at 16.9 tok/s, and used 20.1 GB. Qwen 2.5 Coder 32B followed closely on quality at 87.2%, but ran at only 2.5 tok/s. Qwen 2.5 Coder 7B looked like the practical budget pick: 84.2% at 11.3 tok/s in 4.5 GB. For a local daily coding assistant, that speed and memory profile can matter as much as the top score.
The most debated section was the poor Gemma 4 result. Gemma 4 31B came in at 31.1%, Gemma 4 E4B at 14.6%, and Gemma 4 26B-A4B MoE at 12.2%. The author said the Q4_K_M quantization may be hurting Gemma 4 more than other architectures, or HumanEval+ may not reflect its strengths. Comments added another practical theory: recent Gemma 4 tool-calling issues can cause premature stops near tool calls, and fixes in Google and llama.cpp code paths may not fully remove the issue.
That back-and-forth is what makes the post useful. It does not turn one benchmark into a universal ranking. It shows that local model choice is a hardware and workload question. A model can be impressive on a hosted leaderboard and still be wrong for a laptop. Another model can be less fashionable and still win because it fits in memory, generates fast enough, and passes the user's test suite.
The original discussion is on Reddit, with the author's Medium writeup, GitHub repo, and Hugging Face dataset linked from the post. The community reaction is a reminder that local LLM users increasingly want reproducible numbers, not broad claims about which model “feels” better.
Related Articles
LocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
LocalLLaMA reacted because the joke-like idea of an LLM tuning its own runtime came with concrete benchmark numbers. The author says llm-server v2 adds --ai-tune, feeding llama-server help into a tuning loop that searches flag combinations and caches the fastest config; on their rig, Qwen3.5-27B Q4_K_M moved from 18.5 tok/s to 40.05 tok/s.
Comments (0)
No comments yet. Be the first to comment!