A LocalLLaMA Benchmark Suggests MoE Models Fit 32 GB Apple Laptops Well

Original: I benchmarked 37 LLMs on MacBook Air M5 32GB — full results + open-source tool to benchmark your own Mac View original →

Read in other languages: 한국어日本語
LLM Apr 7, 2026 By Insights AI (Reddit) 2 min read Source

A recent LocalLLaMA discussion shared results from Mac LLM Bench, an open repository that tries to make Apple Silicon local-LLM performance easier to compare. The author benchmarked 37 models across 10 families on a 32 GB MacBook Air M5 using llama-bench with Q4_K_M quantization and published both the numbers and the scripts behind them.

The headline finding is not that one model wins universally, but that mixture-of-experts models appear to be a particularly strong fit for 32 GB laptops. In the posted results, Qwen 3.5 35B-A3B MoE reached 31.3 tokens per second on tg128 while using about 20.7 GB of RAM, whereas dense 32B-class models clustered near 2.5 tokens per second with roughly 18.6 to 18.7 GB of memory use. Smaller models naturally ran much faster, with Qwen 3 0.6B at 91.9 tok/s and Llama 3.2 1B at 59.4 tok/s, but the interesting comparison is the balance between interactivity and capability in the mid-to-large range.

The repository is built to be reproducible rather than anecdotal. It supports both GGUF benchmarks through llama.cpp and optional MLX benchmarks through mlx_lm.benchmark, stores fixed-token metrics such as pp128, pp256, pp512, tg128, and tg256, and organizes results by chip generation and hardware configuration. At the time of the post, the M5 section included 41 benchmarks when GGUF and MLX runs were combined.

What developers should take from it

The most useful point in the LocalLLaMA post is practical: a 32 GB Apple laptop has a clear wall for dense 32B models, and MoE designs can sometimes deliver a better latency-to-capability tradeoff. That does not make the published numbers universal, because runtime choice, quantization, thermal conditions, and prompt shape all matter. But it does provide a community-maintained starting point for hardware planning.

  • Focus machine in this result set: MacBook Air M5 with 32 GB RAM.
  • Primary benchmark tool: llama-bench, with separate support for MLX runs.
  • Project goal: a cross-generation benchmark database for M1 through M5 systems.

For local-LLM users, the value is not just one leaderboard screenshot. It is the emergence of a repeatable, open benchmark workflow that others can extend with their own machines and model choices.

Share: Long

Related Articles

LLM Hacker News 6d ago 2 min read

A March 31, 2026 Hacker News hit brought attention to Ollama’s new MLX-based Apple Silicon runtime. The announcement combines MLX, NVFP4, and upgraded cache behavior to make local coding-agent workloads on macOS more practical.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.