Ollama previews MLX-powered Apple Silicon runtime

On March 31, 2026, a Hacker News thread about Ollama’s new MLX runtime climbed to 605 points and 328 comments. The linked source is Ollama’s March 30 announcement that its Apple Silicon build now runs on top of Apple’s MLX framework, a move aimed at getting more performance out of unified memory on modern Macs.

According to the official announcement, the preview is focused on local coding and agentic workloads rather than a general marketing refresh. Ollama says the new stack improves both time to first token and decode speed, and that M5, M5 Pro, and M5 Max systems can use new GPU Neural Accelerators. The launch example uses Alibaba’s Qwen3.5-35B-A3B model quantized to NVFP4, and Ollama says version 0.19 should push performance higher again with int4 quantization.

Key launch claims

Apple Silicon inference now runs on MLX instead of the previous local stack.
NVFP4 support is meant to preserve quality while reducing memory bandwidth and storage pressure.
Cache reuse across conversations, smarter prompt checkpoints, and improved eviction target long-lived agent sessions.
The preview release is currently tuned around a Qwen3.5 coding model and Ollama recommends Macs with more than 32GB of unified memory.

The interesting change is not just raw throughput. Ollama is also trying to remove the workflow friction that shows up when tools like Claude Code, OpenCode, or Codex repeatedly send large system prompts and tool traces. Reusing cache across branches and saving intermediate checkpoints can matter as much as tokenizer speed if the goal is to make local agents feel responsive enough for daily development work.

There are still practical caveats. The published figures are vendor-reported launch metrics, so developers will want independent tests on older M-series machines and on real IDE or terminal workflows before treating the gains as settled. Even so, the Hacker News response shows strong demand for a local stack that narrows the gap between consumer Macs and cloud inference without forcing users into a private-serving setup.

Community source: Hacker News discussion. Primary source: Ollama blog.

Ollama previews MLX-powered Apple Silicon runtime

Key launch claims

Related Articles

Ollama’s MLX Preview Pushes Local LLM Performance on Apple Silicon

r/LocalLLaMA Tests Qwen 3.5 9B as a Real Local Agent on an M1 Pro

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

Related Articles

Ollama’s MLX Preview Pushes Local LLM Performance on Apple Silicon
LLM Hacker News Mar 31, 2026 1 min read

r/LocalLLaMA Tests Qwen 3.5 9B as a Real Local Agent on an M1 Pro
LLM Reddit Mar 10, 2026 2 min read

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max
LLM Reddit Mar 14, 2026 2 min read