Multi-Token Prediction Support Lands in llama.cpp

MTP Is Now in llama.cpp

PR #22673 has been merged into the llama.cpp master branch, bringing official Multi-Token Prediction (MTP) support to the most widely used local LLM inference engine. The news earned 300+ upvotes on r/LocalLLaMA as the community celebrated the milestone.

What Is MTP?

Standard autoregressive language models generate tokens one at a time in sequence. Multi-Token Prediction trains models to predict multiple future tokens in a single forward pass. DeepSeek-V3 and DeepSeek-R1 used MTP to achieve significant inference speed improvements, attracting considerable attention from the AI community.

Practical Impact

MTP is a training-time technique, so not every model benefits immediately — only models trained with MTP support will see speedups at inference time. But as newer models increasingly incorporate MTP during training, llama.cpp users will be positioned to take advantage of those gains without any additional setup. Paired with parallel generation approaches like Orthrus, local LLM inference is accelerating rapidly.

Why llama.cpp Matters

llama.cpp is the de facto standard for CPU and Apple Silicon LLM inference, used across Mac, Linux, and Windows environments by a massive community of local AI enthusiasts and developers. This merge demonstrates how quickly the open-source AI infrastructure absorbs cutting-edge research techniques.

LLM Reddit May 4, 2026 1 min read

Llama.cpp Multi-Token Prediction Support Enters Beta, Closing the vLLM Performance Gap

llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.

#llama-cpp #mtp #local-llm

LLM Reddit 6d ago 1 min read

Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.

#local-llm #qwen #llama-cpp

LLM Reddit Apr 5, 2026 1 min read

LocalLLaMA warns against judging Gemma 4 too early while llama.cpp fixes are still landing

A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.

#gemma-4 #llama-cpp #inference