LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average
Original: Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code) View original →
What the LocalLLaMA post measured
A r/LocalLLaMA post shared a controlled speculative decoding benchmark for Gemma 4 31B using Gemma 4 E2B as the draft model. The thread snapshot used for this crawl showed 257 upvotes and 88 comments. The test environment was specific: RTX 5090 with 32GB VRAM, Windows 11, a llama.cpp fork with TurboQuant KV cache, 128K context, Flash Attention, and --draft-max 8 --draft-min 1.
Observed throughput gains
The poster reported a baseline average of 57.17 t/s and a speculative decoding average of 73.73 t/s, which is a +29.0% improvement. The biggest gains were on structured outputs: math rose from 57.45 to 85.86 t/s, and code generation from 57.15 to 86.05 t/s, both about +50%. Semi-structured explanations gained about +24%, while translation and analysis still improved +10.7% despite a lower 42.2% acceptance rate.
The compatibility trap
The most operationally useful detail was not the headline benchmark but the failure case. Before re-downloading the model, the author saw the draft path fall to 7.31 t/s because llama.cpp warned that the target and draft vocabularies were not compatible. The post attributes that slowdown to a metadata mismatch in add_bos_token: an early April Gemma 4 31B GGUF had false, while the later E2B draft had true, which forced token translation and erased the expected speedup.
Configuration notes that matter in practice
The post argues that --parallel 1 is mandatory in this setup because automatic parallelism multiplies the draft KV allocation and can consume enough VRAM to destroy performance. It also claims a Q4 draft is sufficient, that speculative decoding cannot be combined with multimodal vision in this stack, and that the additional VRAM cost is about 2.3GB, bringing the total to roughly 23.4GB at 128K context and 25.5GB at 256K.
What operators can take from it
The broader lesson is that speculative decoding gains depend on workload shape and tokenizer compatibility, not just on attaching any smaller draft model. In the reported sweep, draft-max 8 produced the best mixed-workload average, while 16 improved math further but gave back performance on creative tasks. For local LLM operators, that makes this thread useful as a deployment note: verify vocab compatibility first, then tune draft length against the specific kinds of outputs you care about.
Related Articles
A recent LocalLLaMA discussion shared results from Mac LLM Bench, an open benchmark workflow for Apple Silicon systems. The most useful takeaway is practical: dense 32B models hit a clear wall on a 32 GB MacBook Air M5, while some MoE models offer a much better latency-to-capability tradeoff.
A LocalLLaMA post with roughly 350 points argues that Gemma 4 26B A3B becomes unusually effective for local coding-agent and tool-calling workflows when paired with the right runtime settings, contrasting it with prompt-caching and function-calling issues the poster saw in other local-model setups.
A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.
Comments (0)
No comments yet. Be the first to comment!