r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing
Original: [Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback) View original →
A Reddit post in r/LocalLLaMA shared a second-pass benchmark run for Apple's M5 Max 128GB system and drew 104 points with 46 comments. The thread was posted at 2026-03-22T13:04:04.000Z. The author says this is a rerun after community feedback on an earlier post, with methodology updated and llama-bench used for more standardized measurements. That framing matters: these are community benchmarks, not vendor or lab benchmarks.
The hardware setup is described in unusual detail. The post lists an Apple M5 Max with an 18-core CPU, a 40-core Metal GPU, 128GB of unified memory, and 614 GB/s of memory bandwidth, running macOS 26.3.1 with llama.cpp v8420 and MLX v0.31.1. The author's main claim is that the chip's most visible improvement shows up in prompt processing rather than in token generation alone. On that measure, the post reports Qwen 3.5 35B-A3B MoE at 2,845 tok/s PP512 and 2,063 tok/s PP8192, while Qwen 3.5 122B-A10B MoE is listed at 1,011 tok/s PP512 and 749 tok/s PP8192.
The token generation numbers are also notable. The same post reports 92.2 tok/s for Qwen 3.5 35B-A3B MoE, 41.5 tok/s for Qwen 3.5 122B-A10B MoE, 24.3 tok/s for Qwen 3.5 27B Q4_K_M in llama.cpp, and 31.6 tok/s for MLX 4-bit Qwen 3.5 27B. One of the more useful parts of the thread is a correction: the author says an earlier v1 claim that MLX was 92% faster than llama.cpp was unfair because it compared different quantization levels. In the revised write-up, that edge is narrowed to roughly 30% at equivalent 4-bit quantization.
The post's broader thesis is that Mixture-of-Experts models benefit especially well from Apple's unified memory design because only active experts need to be read per token. That is how the author explains why the 35B-A3B MoE result looks so strong relative to dense 27B models, despite a larger on-disk footprint. If those measurements generalize, the implication is that Apple Silicon may be more compelling for MoE-heavy local inference than dense-model comparisons alone would suggest.
At the same time, the thread should not be treated as neutral ground truth. Performance depends heavily on quantization, runtime, test shape, prompt length, and model build. The author does provide a fair amount of methodology, including the use of full GPU offload, flash attention, and specific GGUF sources, but those details do not eliminate the usual variance that comes with local benchmarking. Readers should treat the numbers as a detailed community datapoint rather than an industry-standard benchmark table.
That still makes the post useful. It turns vague questions about Apple Silicon and local LLM performance into concrete numbers, especially around MoE prompt processing and the fairness of MLX versus llama.cpp comparisons. Anyone evaluating the claim set should read the Reddit thread directly and map the reported results onto their own workloads, context sizes, and inference stack.
Related Articles
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.
A fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.
Comments (0)
No comments yet. Be the first to comment!