Ollama previews MLX-powered Apple Silicon runtime

Original: Ollama is now powered by MLX on Apple Silicon in preview View original →

Read in other languages: 한국어日本語
LLM Apr 1, 2026 By Insights AI (HN) 2 min read 1 views Source

On March 31, 2026, a Hacker News thread about Ollama’s new MLX runtime climbed to 605 points and 328 comments. The linked source is Ollama’s March 30 announcement that its Apple Silicon build now runs on top of Apple’s MLX framework, a move aimed at getting more performance out of unified memory on modern Macs.

According to the official announcement, the preview is focused on local coding and agentic workloads rather than a general marketing refresh. Ollama says the new stack improves both time to first token and decode speed, and that M5, M5 Pro, and M5 Max systems can use new GPU Neural Accelerators. The launch example uses Alibaba’s Qwen3.5-35B-A3B model quantized to NVFP4, and Ollama says version 0.19 should push performance higher again with int4 quantization.

Key launch claims

  • Apple Silicon inference now runs on MLX instead of the previous local stack.
  • NVFP4 support is meant to preserve quality while reducing memory bandwidth and storage pressure.
  • Cache reuse across conversations, smarter prompt checkpoints, and improved eviction target long-lived agent sessions.
  • The preview release is currently tuned around a Qwen3.5 coding model and Ollama recommends Macs with more than 32GB of unified memory.

The interesting change is not just raw throughput. Ollama is also trying to remove the workflow friction that shows up when tools like Claude Code, OpenCode, or Codex repeatedly send large system prompts and tool traces. Reusing cache across branches and saving intermediate checkpoints can matter as much as tokenizer speed if the goal is to make local agents feel responsive enough for daily development work.

There are still practical caveats. The published figures are vendor-reported launch metrics, so developers will want independent tests on older M-series machines and on real IDE or terminal workflows before treating the gains as settled. Even so, the Hacker News response shows strong demand for a local stack that narrows the gap between consumer Macs and cloud inference without forcing users into a private-serving setup.

Community source: Hacker News discussion. Primary source: Ollama blog.

Share: Long

Related Articles

LLM Reddit 3d ago 2 min read

A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.