Ollama 0.17 Arrives with New Inference Engine: Up to 40% Faster Local AI

Ollama 0.17: A New Engine Under the Hood

Ollama, the popular local AI model runner, released version 0.17 on February 22, 2026, introducing a significant architectural change: a new native Ollama Engine that replaces the previous reliance on llama.cpp's server mode. The result is up to 40% faster prompt processing and 18% faster token generation on NVIDIA GPUs—with no changes required from users.

The Architecture Change

The new engine integrates the llama.cpp library more directly into Ollama's own scheduling and memory management layer. This gives the Ollama team finer control over how models are loaded, how memory is allocated across GPUs, and how concurrent requests are handled. Users continue to interact with Ollama exactly as before—the change is entirely under the hood.

Performance Gains at a Glance

Up to 40% faster prompt processing on NVIDIA GPUs
Up to 18% faster token generation on NVIDIA GPUs
Around 10–15% faster prompt processing on Apple Silicon

Better Multi-GPU Support and Memory Management

The release improves tensor parallelism for distributing large models (70B+ parameters) across multiple NVIDIA GPUs. Enhanced KV cache quantization allows users to maintain longer conversations and process longer documents without exhausting GPU memory.

Expanded Hardware Support

Version 0.17 adds support for AMD Radeon RX 9070 series (RDNA 4) and improved Intel Arc GPU compatibility via updated oneAPI and SYCL integration—significantly broadening Ollama's reach beyond NVIDIA and Apple Silicon.

Source: Ollama Releases — GitHub

Ollama 0.17 Arrives with New Inference Engine: Up to 40% Faster Local AI

Ollama 0.17: A New Engine Under the Hood

The Architecture Change

Performance Gains at a Glance

Better Multi-GPU Support and Memory Management

Expanded Hardware Support

Related Articles

llmfit: Auto-Select the Right LLM Model for Your Hardware

Gemma 4 12B puts the spotlight on encoder-free multimodal local AI

Ollama brings NVIDIA’s Nemotron-Cascade-2 into local and agent workflows