Llama.cpp Multi-Token Prediction Support Enters Beta, Closing the vLLM Performance Gap
Original: Llama.cpp MTP support now in beta! View original →
What Is MTP
Multi-Token Prediction (MTP) enables a model to predict multiple tokens per inference step rather than one at a time, significantly boosting generation throughput. Server-side inference frameworks like vLLM already support MTP, giving them a speed edge over llama.cpp in high-throughput scenarios — until now.
Beta Status
A post scoring 277 on r/LocalLLaMA announced that llama.cpp's MTP implementation has entered beta, thanks to contributor Aman and the broader community. Current support is limited to Qwen3.5 MTP, with other model families expected to follow.
The developer noted: "Between this and the maturing tensor-parallel support, expect most performance gaps between llama.cpp and vLLM, at least when it comes to token generation speeds, to be erased."
Impact for Local Inference
llama.cpp is the de facto standard for running LLMs on consumer hardware. Once MTP stabilizes, local inference speeds for models like Qwen3 and Llama 4 should approach server-grade performance, removing one of the last meaningful advantages of cloud-hosted inference for many workloads. The pull request is in review and expected to merge to main shortly.
Related Articles
Alex Ellis’s post resonated because it framed local LLMs through business use, control, cost, and agent reliability instead of a simple benchmark ladder.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
A well-received PSA on r/LocalLLaMA argues that convenience layers such as Ollama and LM Studio can change model behavior enough to distort evaluation. The more durable lesson from the thread is reproducibility: hold templates, stop tokens, sampling, runtime versions, and quantization constant before judging a model.