LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

Original: I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed View original →

Read in other languages: 한국어日本語
LLM Apr 23, 2026 By Insights AI (Reddit) 2 min read Source

A r/LocalLLaMA benchmark post hit the community's sweet spot: models tested on real consumer hardware, with code quality, speed, and memory shown together. The author said the goal was to replace “trust me” coding-model recommendations with a direct comparison, using HumanEval+ pass@1 across 164 coding problems plus token speed and VRAM footprint on a MacBook Air M5.

The headline result is Qwen 3.6 35B-A3B. In the author's table, the MoE model scored 89.6% on HumanEval+, generated at 16.9 tok/s, and used 20.1 GB. Qwen 2.5 Coder 32B followed closely on quality at 87.2%, but ran at only 2.5 tok/s. Qwen 2.5 Coder 7B looked like the practical budget pick: 84.2% at 11.3 tok/s in 4.5 GB. For a local daily coding assistant, that speed and memory profile can matter as much as the top score.

The most debated section was the poor Gemma 4 result. Gemma 4 31B came in at 31.1%, Gemma 4 E4B at 14.6%, and Gemma 4 26B-A4B MoE at 12.2%. The author said the Q4_K_M quantization may be hurting Gemma 4 more than other architectures, or HumanEval+ may not reflect its strengths. Comments added another practical theory: recent Gemma 4 tool-calling issues can cause premature stops near tool calls, and fixes in Google and llama.cpp code paths may not fully remove the issue.

That back-and-forth is what makes the post useful. It does not turn one benchmark into a universal ranking. It shows that local model choice is a hardware and workload question. A model can be impressive on a hosted leaderboard and still be wrong for a laptop. Another model can be less fashionable and still win because it fits in memory, generates fast enough, and passes the user's test suite.

The original discussion is on Reddit, with the author's Medium writeup, GitHub repo, and Hugging Face dataset linked from the post. The community reaction is a reminder that local LLM users increasingly want reproducible numbers, not broad claims about which model “feels” better.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.