r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

A Reddit post in r/LocalLLaMA shared a second-pass benchmark run for Apple's M5 Max 128GB system and drew 104 points with 46 comments. The thread was posted at 2026-03-22T13:04:04.000Z. The author says this is a rerun after community feedback on an earlier post, with methodology updated and llama-bench used for more standardized measurements. That framing matters: these are community benchmarks, not vendor or lab benchmarks.

The hardware setup is described in unusual detail. The post lists an Apple M5 Max with an 18-core CPU, a 40-core Metal GPU, 128GB of unified memory, and 614 GB/s of memory bandwidth, running macOS 26.3.1 with llama.cpp v8420 and MLX v0.31.1. The author's main claim is that the chip's most visible improvement shows up in prompt processing rather than in token generation alone. On that measure, the post reports Qwen 3.5 35B-A3B MoE at 2,845 tok/s PP512 and 2,063 tok/s PP8192, while Qwen 3.5 122B-A10B MoE is listed at 1,011 tok/s PP512 and 749 tok/s PP8192.

The token generation numbers are also notable. The same post reports 92.2 tok/s for Qwen 3.5 35B-A3B MoE, 41.5 tok/s for Qwen 3.5 122B-A10B MoE, 24.3 tok/s for Qwen 3.5 27B Q4_K_M in llama.cpp, and 31.6 tok/s for MLX 4-bit Qwen 3.5 27B. One of the more useful parts of the thread is a correction: the author says an earlier v1 claim that MLX was 92% faster than llama.cpp was unfair because it compared different quantization levels. In the revised write-up, that edge is narrowed to roughly 30% at equivalent 4-bit quantization.

The post's broader thesis is that Mixture-of-Experts models benefit especially well from Apple's unified memory design because only active experts need to be read per token. That is how the author explains why the 35B-A3B MoE result looks so strong relative to dense 27B models, despite a larger on-disk footprint. If those measurements generalize, the implication is that Apple Silicon may be more compelling for MoE-heavy local inference than dense-model comparisons alone would suggest.

At the same time, the thread should not be treated as neutral ground truth. Performance depends heavily on quantization, runtime, test shape, prompt length, and model build. The author does provide a fair amount of methodology, including the use of full GPU offload, flash attention, and specific GGUF sources, but those details do not eliminate the usual variance that comes with local benchmarking. Readers should treat the numbers as a detailed community datapoint rather than an industry-standard benchmark table.

That still makes the post useful. It turns vague questions about Apple Silicon and local LLM performance into concrete numbers, especially around MoE prompt processing and the fairness of MLX versus llama.cpp comparisons. Anyone evaluating the claim set should read the Reddit thread directly and map the reported results onto their own workloads, context sizes, and inference stack.

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

Related Articles

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

Related Articles

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max
LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable
LLM Reddit Mar 14, 2026 2 min read

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
LLM Reddit Mar 12, 2026 2 min read