LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

A r/LocalLLaMA benchmark post hit the community's sweet spot: models tested on real consumer hardware, with code quality, speed, and memory shown together. The author said the goal was to replace “trust me” coding-model recommendations with a direct comparison, using HumanEval+ pass@1 across 164 coding problems plus token speed and VRAM footprint on a MacBook Air M5.

The headline result is Qwen 3.6 35B-A3B. In the author's table, the MoE model scored 89.6% on HumanEval+, generated at 16.9 tok/s, and used 20.1 GB. Qwen 2.5 Coder 32B followed closely on quality at 87.2%, but ran at only 2.5 tok/s. Qwen 2.5 Coder 7B looked like the practical budget pick: 84.2% at 11.3 tok/s in 4.5 GB. For a local daily coding assistant, that speed and memory profile can matter as much as the top score.

The most debated section was the poor Gemma 4 result. Gemma 4 31B came in at 31.1%, Gemma 4 E4B at 14.6%, and Gemma 4 26B-A4B MoE at 12.2%. The author said the Q4_K_M quantization may be hurting Gemma 4 more than other architectures, or HumanEval+ may not reflect its strengths. Comments added another practical theory: recent Gemma 4 tool-calling issues can cause premature stops near tool calls, and fixes in Google and llama.cpp code paths may not fully remove the issue.

That back-and-forth is what makes the post useful. It does not turn one benchmark into a universal ranking. It shows that local model choice is a hardware and workload question. A model can be impressive on a hosted leaderboard and still be wrong for a laptop. Another model can be less fashionable and still win because it fits in memory, generates fast enough, and passes the user's test suite.

The original discussion is on Reddit, with the author's Medium writeup, GitHub repo, and Hugging Face dataset linked from the post. The community reaction is a reminder that local LLM users increasingly want reproducible numbers, not broad claims about which model “feels” better.

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

Related Articles

A Pac-Man prompt pushed LocalLLaMA to argue about something bigger than tokens per second

Qwen 3.6 27B tests the practical edge of local development

r/LocalLLaMA benchmark compares Qwen3.5-27B Q4 quants using KLD and size tradeoffs

Related Articles

A Pac-Man prompt pushed LocalLLaMA to argue about something bigger than tokens per second
LLM Reddit May 1, 2026 2 min read

Qwen 3.6 27B tests the practical edge of local development
LLM Hacker News Jun 30, 2026 1 min read

r/LocalLLaMA benchmark compares Qwen3.5-27B Q4 quants using KLD and size tradeoffs
LLM Reddit Mar 4, 2026 1 min read