Multi-Token Prediction Support Lands in llama.cpp
Original: MTP support merged into llama.cpp View original →
MTP Is Now in llama.cpp
PR #22673 has been merged into the llama.cpp master branch, bringing official Multi-Token Prediction (MTP) support to the most widely used local LLM inference engine. The news earned 300+ upvotes on r/LocalLLaMA as the community celebrated the milestone.
What Is MTP?
Standard autoregressive language models generate tokens one at a time in sequence. Multi-Token Prediction trains models to predict multiple future tokens in a single forward pass. DeepSeek-V3 and DeepSeek-R1 used MTP to achieve significant inference speed improvements, attracting considerable attention from the AI community.
Practical Impact
MTP is a training-time technique, so not every model benefits immediately — only models trained with MTP support will see speedups at inference time. But as newer models increasingly incorporate MTP during training, llama.cpp users will be positioned to take advantage of those gains without any additional setup. Paired with parallel generation approaches like Orthrus, local LLM inference is accelerating rapidly.
Why llama.cpp Matters
llama.cpp is the de facto standard for CPU and Apple Silicon LLM inference, used across Mac, Linux, and Windows environments by a massive community of local AI enthusiasts and developers. This merge demonstrates how quickly the open-source AI infrastructure absorbs cutting-edge research techniques.
Related Articles
llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
A LocalLLaMA user shares their config for running Qwen3.6 35B A3B at over 80 tok/sec with 128K context on a 12GB VRAM GPU, using llama.cpp's Multi-Token Prediction support and achieving 80%+ draft acceptance rate.
A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.
Comments (0)
No comments yet. Be the first to comment!