Ollama’s MLX Preview Pushes Local LLM Performance on Apple Silicon

On March 30, 2026, Ollama said its Apple Silicon preview is now built on MLX, Apple’s machine learning framework. The linked Hacker News discussion reached 226 points and 101 comments on March 31, a sign of how much attention local LLM performance on macOS is getting from developers.

What changed

According to Ollama’s announcement, the new path uses MLX and Apple’s unified memory architecture to speed up both prefill and decode. On M5, M5 Pro, and M5 Max systems, Ollama also says it can use the new GPU Neural Accelerators to improve both time to first token and steady-state generation speed.

Prefill moved from 1154 tokens/s in Ollama 0.18 to 1810 tokens/s in Ollama 0.19.
Decode moved from 58 tokens/s to 112 tokens/s.
With int4, Ollama says the same setup can reach 1851 tokens/s prefill and 134 tokens/s decode.

The benchmark setup matters. Ollama says the test was run on March 29, 2026 with Alibaba’s Qwen3.5-35B-A3B quantized to NVFP4, while the older implementation used Q4_K_M. So the announcement is not just a backend swap. It is also a new quantization path and a local inference workflow tuned for coding-oriented models.

Why it matters

Ollama is also adding NVFP4 support, which it frames as a way to keep quality closer to production inference while reducing bandwidth and storage pressure. The release notes pair that with cache reuse across conversations, intelligent prompt checkpoints, and smarter eviction, all aimed at agentic and coding workloads rather than single-turn chat demos.

For developers using tools such as Claude Code, OpenCode, or Codex on Macs with more than 32 GB of unified memory, the preview points to a more practical local stack. The original source is the Ollama blog post; community reaction is visible in the Hacker News thread.

Ollama’s MLX Preview Pushes Local LLM Performance on Apple Silicon

What changed

Why it matters

Related Articles

Ollama previews MLX-powered Apple Silicon runtime

r/LocalLLaMA Tests Qwen 3.5 9B as a Real Local Agent on an M1 Pro

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

Related Articles

Ollama previews MLX-powered Apple Silicon runtime
LLM Hacker News Apr 1, 2026 2 min read

r/LocalLLaMA Tests Qwen 3.5 9B as a Real Local Agent on an M1 Pro
LLM Reddit Mar 10, 2026 2 min read

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max
LLM Reddit Mar 14, 2026 2 min read