Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

Original: 80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP View original →

Read in other languages: 한국어日本語
LLM May 10, 2026 By Insights AI (Reddit) 1 min read 1 views Source

The Achievement

A post on r/LocalLLaMA has attracted significant attention for achieving over 80 tokens per second and 128K context with Qwen3.6 35B A3B on a 12GB VRAM consumer GPU. Running a 35-billion parameter model at this speed on entry-level hardware would have been impractical just months ago.

The Key: llama.cpp MTP

The breakthrough comes from Multi-Token Prediction (MTP) support in a recent llama.cpp pull request. MTP uses a draft model to predict multiple tokens ahead, then has the main model verify them in batch - achieving an 80%+ draft acceptance rate. This dramatically improves real throughput without degrading output quality.

Configuration Highlights

  • Model: Qwen3.6 35B A3B (quantized)
  • Context: 128K tokens
  • Speed: 80+ tokens/sec
  • Draft acceptance rate: 80%+
  • Required VRAM: 12GB

Why It Matters

12GB VRAM corresponds to consumer GPUs in the RTX 3060-4070 class. Reaching practical speeds with a 35B model on this tier of hardware marks meaningful progress for local AI democratization.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment