#mtp

LLM Hacker News 2h ago 1 min read

Google Releases Multi-Token Prediction Drafters for Gemma 4

Google has released open-weight MTP drafter models for Gemma 4 31B and 26B-A4B, enabling speculative decoding to significantly boost inference speed without affecting output quality.

#google #gemma #mtp

LLM Reddit 1d ago 1 min read

Llama.cpp Multi-Token Prediction Support Enters Beta, Closing the vLLM Performance Gap

llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.

#llama-cpp #mtp #local-llm

LLM Reddit Mar 21, 2026 3 min read

r/LocalLLaMA Spots Native MTP for Qwen3.5 in mlx-lm and Faster Single-Stream Inference

A Reddit thread in r/LocalLLaMA spotlighted mlx-lm PR #990, which uses Qwen3.5's built-in MTP head for native speculative decoding and reports 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. The gain is meaningful, but so are the constraints around converted checkpoints, disabled batching, and untested MoE variants.

#mlx-lm #qwen3.5 #mtp