LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup
Original: Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) View original →
A new r/LocalLLaMA benchmark suggests speculative decoding is already a practical speed win for local Gemma 4 deployments, provided the model files and runtime flags line up. The post tested Gemma 4 31B as the target model with Gemma 4 E2B as a draft model and reported an average throughput increase from 57.17 tokens per second to 73.73 tokens per second, or about 29%, on an RTX 5090 under Windows 11.
The most interesting part was not the raw number but the failure mode that preceded it. The author says early tests were dramatically slower because the target and draft vocabularies were technically incompatible. After inspecting speculative.cpp, they traced the problem to a mismatch in add_bos_token metadata between an early-April Gemma 4 31B GGUF and a later E2B download. That forced llama.cpp into token translation mode, erased the expected gain, and even pushed throughput down to about 7.31 tokens per second in the broken configuration.
Once the 31B GGUF was re-downloaded with corrected tokenizer metadata, the speedups became far more compelling. Code generation and math prompts were roughly 50% faster, science explanations gained about 24%, and even low-predictability tasks such as translation still stayed modestly positive. The post also reports that --draft-max 8 was the best overall setting for mixed workloads, while --parallel 1 was effectively mandatory because higher automatic parallelism multiplied KV-cache pressure and wasted VRAM.
For local inference users, the thread is a good reminder that performance tuning is no longer just about picking a smaller draft model. GGUF metadata, tokenizer compatibility, context length, KV cache behavior, and modality choices all matter. The author estimated only about 2.3GB of extra VRAM beyond the main model for the Q4 draft setup, which makes the technique far more accessible than many people assume. In short, speculative decoding looks mature enough to be a real default experiment for Gemma 4 users, but only if they verify that their model artifacts are aligned before benchmarking.
Source: r/LocalLLaMA benchmark post.
Related Articles
A LocalLLaMA user compared Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B across 30 blind prompts judged by Claude Opus 4.6. The result is not one clear winner but a more useful trade-off story around reliability, verbosity, and category-specific strengths.
A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.
A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.
Comments (0)
No comments yet. Be the first to comment!