r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max
Original: MLX is NOT FASTER? I benchmarked MLX vs llama.cpp on the M1 Max, and the results surprised me... View original →
A recent r/LocalLLaMA thread pushed back on one of the most common local LLM claims on Apple Silicon: that switching from GGUF or Ollama-style stacks to MLX automatically makes inference “faster.” The post's point is not that MLX is slow. It is that many community comparisons rely on generation tokens per second alone, even when the user-facing latency is dominated by prompt ingestion and context handling. For people running coding agents, document classifiers, or multi-turn RAG workflows, that distinction matters more than a benchmark screenshot.
The benchmark setup was concrete enough to be useful. The author tested Qwen3.5-35B-A3B on an M1 Max 64GB using LM Studio 0.4.5, comparing an MLX 4-bit path against a GGUF Q4_K_M path backed by llama.cpp. On pure generation speed, MLX did exactly what people expect: around 57 tok/s versus roughly 29 tok/s for GGUF. But once the prompt got large, the picture changed. With a short prompt of about 655 tokens, the author's effective throughput from request send to final token was about 13 tok/s for MLX versus about 20 tok/s for llama.cpp. At a much longer prompt of roughly 8,496 tokens, both paths landed around 3 tok/s end to end.
The key reason was prefill. In the long-context case, the author measured prefill at roughly 94% of total response time for the MLX run, which means a much higher generation rate no longer dominates the actual user wait. That led to a more nuanced conclusion than “MLX bad” or “llama.cpp good.” MLX still looked better when the output was long and the context was short. But once the workflow became retrieval-heavy or agentic and prompt ingestion expanded, the advertised speedup narrowed sharply. The thread also suggested that model-specific support can matter. In this case, commenters argued that Qwen3.5's hybrid attention path may currently be better supported in llama.cpp.
The practical takeaway is that local LLM users should benchmark the workload they actually care about, not a single generation number. If your stack spends most of its time reading large contexts, prompt-processing efficiency can outweigh raw decode speed. If your stack streams long answers from short prompts, MLX may still be the clear winner. The author published the reproducible harness here: github.com/famstack-dev/local-llm-bench.
Related Articles
A fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A well-received r/LocalLLaMA experiment described tinyforge: Qwen 3.5 0.8B running on a MacBook Air, trained on 13 self-generated repair pairs from a test-feedback loop, with a reported holdout jump from 16/50 to 28/50.
Comments (0)
No comments yet. Be the first to comment!