r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

Original: MLX is NOT FASTER? I benchmarked MLX vs llama.cpp on the M1 Max, and the results surprised me... View original →

Read in other languages: 한국어日本語
LLM Mar 14, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A recent r/LocalLLaMA thread pushed back on one of the most common local LLM claims on Apple Silicon: that switching from GGUF or Ollama-style stacks to MLX automatically makes inference “faster.” The post's point is not that MLX is slow. It is that many community comparisons rely on generation tokens per second alone, even when the user-facing latency is dominated by prompt ingestion and context handling. For people running coding agents, document classifiers, or multi-turn RAG workflows, that distinction matters more than a benchmark screenshot.

The benchmark setup was concrete enough to be useful. The author tested Qwen3.5-35B-A3B on an M1 Max 64GB using LM Studio 0.4.5, comparing an MLX 4-bit path against a GGUF Q4_K_M path backed by llama.cpp. On pure generation speed, MLX did exactly what people expect: around 57 tok/s versus roughly 29 tok/s for GGUF. But once the prompt got large, the picture changed. With a short prompt of about 655 tokens, the author's effective throughput from request send to final token was about 13 tok/s for MLX versus about 20 tok/s for llama.cpp. At a much longer prompt of roughly 8,496 tokens, both paths landed around 3 tok/s end to end.

The key reason was prefill. In the long-context case, the author measured prefill at roughly 94% of total response time for the MLX run, which means a much higher generation rate no longer dominates the actual user wait. That led to a more nuanced conclusion than “MLX bad” or “llama.cpp good.” MLX still looked better when the output was long and the context was short. But once the workflow became retrieval-heavy or agentic and prompt ingestion expanded, the advertised speedup narrowed sharply. The thread also suggested that model-specific support can matter. In this case, commenters argued that Qwen3.5's hybrid attention path may currently be better supported in llama.cpp.

The practical takeaway is that local LLM users should benchmark the workload they actually care about, not a single generation number. If your stack spends most of its time reading large contexts, prompt-processing efficiency can outweigh raw decode speed. If your stack streams long answers from short prompts, MLX may still be the clear winner. The author published the reproducible harness here: github.com/famstack-dev/local-llm-bench.

Share: Long

Related Articles

LLM Reddit 4d ago 2 min read

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.