Ollama’s MLX Preview Pushes Local LLM Performance on Apple Silicon
Original: Ollama is now powered by MLX on Apple Silicon in preview View original →
On March 30, 2026, Ollama said its Apple Silicon preview is now built on MLX, Apple’s machine learning framework. The linked Hacker News discussion reached 226 points and 101 comments on March 31, a sign of how much attention local LLM performance on macOS is getting from developers.
What changed
According to Ollama’s announcement, the new path uses MLX and Apple’s unified memory architecture to speed up both prefill and decode. On M5, M5 Pro, and M5 Max systems, Ollama also says it can use the new GPU Neural Accelerators to improve both time to first token and steady-state generation speed.
- Prefill moved from 1154 tokens/s in Ollama 0.18 to 1810 tokens/s in Ollama 0.19.
- Decode moved from 58 tokens/s to 112 tokens/s.
- With
int4, Ollama says the same setup can reach 1851 tokens/s prefill and 134 tokens/s decode.
The benchmark setup matters. Ollama says the test was run on March 29, 2026 with Alibaba’s Qwen3.5-35B-A3B quantized to NVFP4, while the older implementation used Q4_K_M. So the announcement is not just a backend swap. It is also a new quantization path and a local inference workflow tuned for coding-oriented models.
Why it matters
Ollama is also adding NVFP4 support, which it frames as a way to keep quality closer to production inference while reducing bandwidth and storage pressure. The release notes pair that with cache reuse across conversations, intelligent prompt checkpoints, and smarter eviction, all aimed at agentic and coding workloads rather than single-turn chat demos.
For developers using tools such as Claude Code, OpenCode, or Codex on Macs with more than 32 GB of unified memory, the preview points to a more practical local stack. The original source is the Ollama blog post; community reaction is visible in the Hacker News thread.
Related Articles
A high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.
A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.
Running Nvidia PersonaPlex 7B in Swift on Apple Silicon moves local voice agents closer to real time
An HN post on a Swift/MLX port of Nvidia PersonaPlex 7B shows how chunking, buffering, and interrupt handling matter as much as raw model quality for local speech-to-speech agents.
Comments (0)
No comments yet. Be the first to comment!