Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

The Achievement

A post on r/LocalLLaMA has attracted significant attention for achieving over 80 tokens per second and 128K context with Qwen3.6 35B A3B on a 12GB VRAM consumer GPU. Running a 35-billion parameter model at this speed on entry-level hardware would have been impractical just months ago.

The Key: llama.cpp MTP

The breakthrough comes from Multi-Token Prediction (MTP) support in a recent llama.cpp pull request. MTP uses a draft model to predict multiple tokens ahead, then has the main model verify them in batch - achieving an 80%+ draft acceptance rate. This dramatically improves real throughput without degrading output quality.

Configuration Highlights

Model: Qwen3.6 35B A3B (quantized)
Context: 128K tokens
Speed: 80+ tokens/sec
Draft acceptance rate: 80%+
Required VRAM: 12GB

Why It Matters

12GB VRAM corresponds to consumer GPUs in the RTX 3060-4070 class. Reaching practical speeds with a 35B model on this tier of hardware marks meaningful progress for local AI democratization.

LLM Reddit 6d ago 1 min read

Llama.cpp Multi-Token Prediction Support Enters Beta, Closing the vLLM Performance Gap

llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.

#llama-cpp #mtp #local-llm

LLM Reddit 4d ago 1 min read

Qwen 3.6 27B Achieves 2.5x Faster Local Inference via MTP With 262k Context on 48GB

A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.

#qwen #mtp #local-llm

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet

A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.

#qwen #local-llm #llama-cpp