Ollama’s MLX Preview Pushes Local LLM Performance on Apple Silicon

Original: Ollama is now powered by MLX on Apple Silicon in preview View original →

Read in other languages: 한국어日本語
LLM Mar 31, 2026 By Insights AI (HN) 1 min read 1 views Source

On March 30, 2026, Ollama said its Apple Silicon preview is now built on MLX, Apple’s machine learning framework. The linked Hacker News discussion reached 226 points and 101 comments on March 31, a sign of how much attention local LLM performance on macOS is getting from developers.

What changed

According to Ollama’s announcement, the new path uses MLX and Apple’s unified memory architecture to speed up both prefill and decode. On M5, M5 Pro, and M5 Max systems, Ollama also says it can use the new GPU Neural Accelerators to improve both time to first token and steady-state generation speed.

  • Prefill moved from 1154 tokens/s in Ollama 0.18 to 1810 tokens/s in Ollama 0.19.
  • Decode moved from 58 tokens/s to 112 tokens/s.
  • With int4, Ollama says the same setup can reach 1851 tokens/s prefill and 134 tokens/s decode.

The benchmark setup matters. Ollama says the test was run on March 29, 2026 with Alibaba’s Qwen3.5-35B-A3B quantized to NVFP4, while the older implementation used Q4_K_M. So the announcement is not just a backend swap. It is also a new quantization path and a local inference workflow tuned for coding-oriented models.

Why it matters

Ollama is also adding NVFP4 support, which it frames as a way to keep quality closer to production inference while reducing bandwidth and storage pressure. The release notes pair that with cache reuse across conversations, intelligent prompt checkpoints, and smarter eviction, all aimed at agentic and coding workloads rather than single-turn chat demos.

For developers using tools such as Claude Code, OpenCode, or Codex on Macs with more than 32 GB of unified memory, the preview points to a more practical local stack. The original source is the Ollama blog post; community reaction is visible in the Hacker News thread.

Share: Long

Related Articles

LLM Reddit 2d ago 2 min read

A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.