LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

Original: Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) View original →

Read in other languages: 한국어日本語
LLM Apr 12, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A new r/LocalLLaMA benchmark suggests speculative decoding is already a practical speed win for local Gemma 4 deployments, provided the model files and runtime flags line up. The post tested Gemma 4 31B as the target model with Gemma 4 E2B as a draft model and reported an average throughput increase from 57.17 tokens per second to 73.73 tokens per second, or about 29%, on an RTX 5090 under Windows 11.

The most interesting part was not the raw number but the failure mode that preceded it. The author says early tests were dramatically slower because the target and draft vocabularies were technically incompatible. After inspecting speculative.cpp, they traced the problem to a mismatch in add_bos_token metadata between an early-April Gemma 4 31B GGUF and a later E2B download. That forced llama.cpp into token translation mode, erased the expected gain, and even pushed throughput down to about 7.31 tokens per second in the broken configuration.

Once the 31B GGUF was re-downloaded with corrected tokenizer metadata, the speedups became far more compelling. Code generation and math prompts were roughly 50% faster, science explanations gained about 24%, and even low-predictability tasks such as translation still stayed modestly positive. The post also reports that --draft-max 8 was the best overall setting for mixed workloads, while --parallel 1 was effectively mandatory because higher automatic parallelism multiplied KV-cache pressure and wasted VRAM.

For local inference users, the thread is a good reminder that performance tuning is no longer just about picking a smaller draft model. GGUF metadata, tokenizer compatibility, context length, KV cache behavior, and modality choices all matter. The author estimated only about 2.3GB of extra VRAM beyond the main model for the Q4 draft setup, which makes the technique far more accessible than many people assume. In short, speculative decoding looks mature enough to be a real default experiment for Gemma 4 users, but only if they verify that their model artifacts are aligned before benchmarking.

Source: r/LocalLLaMA benchmark post.

Share: Long

Related Articles

LLM Reddit 3d ago 2 min read

A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.