r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

Original: [Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback) View original →

Read in other languages: 한국어日本語
LLM Mar 23, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A Reddit post in r/LocalLLaMA shared a second-pass benchmark run for Apple's M5 Max 128GB system and drew 104 points with 46 comments. The thread was posted at 2026-03-22T13:04:04.000Z. The author says this is a rerun after community feedback on an earlier post, with methodology updated and llama-bench used for more standardized measurements. That framing matters: these are community benchmarks, not vendor or lab benchmarks.

The hardware setup is described in unusual detail. The post lists an Apple M5 Max with an 18-core CPU, a 40-core Metal GPU, 128GB of unified memory, and 614 GB/s of memory bandwidth, running macOS 26.3.1 with llama.cpp v8420 and MLX v0.31.1. The author's main claim is that the chip's most visible improvement shows up in prompt processing rather than in token generation alone. On that measure, the post reports Qwen 3.5 35B-A3B MoE at 2,845 tok/s PP512 and 2,063 tok/s PP8192, while Qwen 3.5 122B-A10B MoE is listed at 1,011 tok/s PP512 and 749 tok/s PP8192.

The token generation numbers are also notable. The same post reports 92.2 tok/s for Qwen 3.5 35B-A3B MoE, 41.5 tok/s for Qwen 3.5 122B-A10B MoE, 24.3 tok/s for Qwen 3.5 27B Q4_K_M in llama.cpp, and 31.6 tok/s for MLX 4-bit Qwen 3.5 27B. One of the more useful parts of the thread is a correction: the author says an earlier v1 claim that MLX was 92% faster than llama.cpp was unfair because it compared different quantization levels. In the revised write-up, that edge is narrowed to roughly 30% at equivalent 4-bit quantization.

The post's broader thesis is that Mixture-of-Experts models benefit especially well from Apple's unified memory design because only active experts need to be read per token. That is how the author explains why the 35B-A3B MoE result looks so strong relative to dense 27B models, despite a larger on-disk footprint. If those measurements generalize, the implication is that Apple Silicon may be more compelling for MoE-heavy local inference than dense-model comparisons alone would suggest.

At the same time, the thread should not be treated as neutral ground truth. Performance depends heavily on quantization, runtime, test shape, prompt length, and model build. The author does provide a fair amount of methodology, including the use of full GPU offload, flash attention, and specific GGUF sources, but those details do not eliminate the usual variance that comes with local benchmarking. Readers should treat the numbers as a detailed community datapoint rather than an industry-standard benchmark table.

That still makes the post useful. It turns vague questions about Apple Silicon and local LLM performance into concrete numbers, especially around MoE prompt processing and the fairness of MLX versus llama.cpp comparisons. Anyone evaluating the claim set should read the Reddit thread directly and map the reported results onto their own workloads, context sizes, and inference stack.

Share: Long

Related Articles

LLM Reddit Mar 14, 2026 2 min read

A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.