LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

A new r/LocalLLaMA benchmark suggests speculative decoding is already a practical speed win for local Gemma 4 deployments, provided the model files and runtime flags line up. The post tested Gemma 4 31B as the target model with Gemma 4 E2B as a draft model and reported an average throughput increase from 57.17 tokens per second to 73.73 tokens per second, or about 29%, on an RTX 5090 under Windows 11.

The most interesting part was not the raw number but the failure mode that preceded it. The author says early tests were dramatically slower because the target and draft vocabularies were technically incompatible. After inspecting speculative.cpp, they traced the problem to a mismatch in add_bos_token metadata between an early-April Gemma 4 31B GGUF and a later E2B download. That forced llama.cpp into token translation mode, erased the expected gain, and even pushed throughput down to about 7.31 tokens per second in the broken configuration.

Once the 31B GGUF was re-downloaded with corrected tokenizer metadata, the speedups became far more compelling. Code generation and math prompts were roughly 50% faster, science explanations gained about 24%, and even low-predictability tasks such as translation still stayed modestly positive. The post also reports that --draft-max 8 was the best overall setting for mixed workloads, while --parallel 1 was effectively mandatory because higher automatic parallelism multiplied KV-cache pressure and wasted VRAM.

For local inference users, the thread is a good reminder that performance tuning is no longer just about picking a smaller draft model. GGUF metadata, tokenizer compatibility, context length, KV cache behavior, and modality choices all matter. The author estimated only about 2.3GB of extra VRAM beyond the main model for the Q4 draft setup, which makes the technique far more accessible than many people assume. In short, speculative decoding looks mature enough to be a real default experiment for Gemma 4 users, but only if they verify that their model artifacts are aligned before benchmarking.

Source: r/LocalLLaMA benchmark post.

LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

Related Articles

LocalLLaMA warns against judging Gemma 4 too early while llama.cpp fixes are still landing

Reddit Says Gemma 4 on llama.cpp Is Finally Stable, With Caveats

Why Reddit Thinks Fresh Gemma 4 GGUF Downloads Matter

Related Articles

LocalLLaMA warns against judging Gemma 4 too early while llama.cpp fixes are still landing
LLM Reddit Apr 5, 2026 1 min read

Reddit Says Gemma 4 on llama.cpp Is Finally Stable, With Caveats
LLM Reddit Apr 9, 2026 2 min read

Why Reddit Thinks Fresh Gemma 4 GGUF Downloads Matter
LLM Reddit Apr 9, 2026 2 min read