r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

A recent r/LocalLLaMA thread pushed back on one of the most common local LLM claims on Apple Silicon: that switching from GGUF or Ollama-style stacks to MLX automatically makes inference “faster.” The post's point is not that MLX is slow. It is that many community comparisons rely on generation tokens per second alone, even when the user-facing latency is dominated by prompt ingestion and context handling. For people running coding agents, document classifiers, or multi-turn RAG workflows, that distinction matters more than a benchmark screenshot.

The benchmark setup was concrete enough to be useful. The author tested Qwen3.5-35B-A3B on an M1 Max 64GB using LM Studio 0.4.5, comparing an MLX 4-bit path against a GGUF Q4_K_M path backed by llama.cpp. On pure generation speed, MLX did exactly what people expect: around 57 tok/s versus roughly 29 tok/s for GGUF. But once the prompt got large, the picture changed. With a short prompt of about 655 tokens, the author's effective throughput from request send to final token was about 13 tok/s for MLX versus about 20 tok/s for llama.cpp. At a much longer prompt of roughly 8,496 tokens, both paths landed around 3 tok/s end to end.

The key reason was prefill. In the long-context case, the author measured prefill at roughly 94% of total response time for the MLX run, which means a much higher generation rate no longer dominates the actual user wait. That led to a more nuanced conclusion than “MLX bad” or “llama.cpp good.” MLX still looked better when the output was long and the context was short. But once the workflow became retrieval-heavy or agentic and prompt ingestion expanded, the advertised speedup narrowed sharply. The thread also suggested that model-specific support can matter. In this case, commenters argued that Qwen3.5's hybrid attention path may currently be better supported in llama.cpp.

The practical takeaway is that local LLM users should benchmark the workload they actually care about, not a single generation number. If your stack spends most of its time reading large contexts, prompt-processing efficiency can outweigh raw decode speed. If your stack streams long answers from short prompts, MLX may still be the clear winner. The author published the reproducible harness here: github.com/famstack-dev/local-llm-bench.

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

Related Articles

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable

Papers with Code now has to track “papers without code”

Related Articles

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing
LLM Reddit Mar 23, 2026 2 min read

r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable
LLM Reddit Mar 14, 2026 2 min read

Papers with Code now has to track “papers without code”