LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average

What the LocalLLaMA post measured

A r/LocalLLaMA post shared a controlled speculative decoding benchmark for Gemma 4 31B using Gemma 4 E2B as the draft model. The thread snapshot used for this crawl showed 257 upvotes and 88 comments. The test environment was specific: RTX 5090 with 32GB VRAM, Windows 11, a llama.cpp fork with TurboQuant KV cache, 128K context, Flash Attention, and --draft-max 8 --draft-min 1.

Observed throughput gains

The poster reported a baseline average of 57.17 t/s and a speculative decoding average of 73.73 t/s, which is a +29.0% improvement. The biggest gains were on structured outputs: math rose from 57.45 to 85.86 t/s, and code generation from 57.15 to 86.05 t/s, both about +50%. Semi-structured explanations gained about +24%, while translation and analysis still improved +10.7% despite a lower 42.2% acceptance rate.

The compatibility trap

The most operationally useful detail was not the headline benchmark but the failure case. Before re-downloading the model, the author saw the draft path fall to 7.31 t/s because llama.cpp warned that the target and draft vocabularies were not compatible. The post attributes that slowdown to a metadata mismatch in add_bos_token: an early April Gemma 4 31B GGUF had false, while the later E2B draft had true, which forced token translation and erased the expected speedup.

Configuration notes that matter in practice

The post argues that --parallel 1 is mandatory in this setup because automatic parallelism multiplies the draft KV allocation and can consume enough VRAM to destroy performance. It also claims a Q4 draft is sufficient, that speculative decoding cannot be combined with multimodal vision in this stack, and that the additional VRAM cost is about 2.3GB, bringing the total to roughly 23.4GB at 128K context and 25.5GB at 256K.

What operators can take from it

The broader lesson is that speculative decoding gains depend on workload shape and tokenizer compatibility, not just on attaching any smaller draft model. In the reported sweep, draft-max 8 produced the best mixed-workload average, while 16 improved math further but gave back performance on creative tasks. For local LLM operators, that makes this thread useful as a deployment note: verify vocab compatibility first, then tune draft length against the specific kinds of outputs you care about.

Reddit discussion thread

LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average

What the LocalLLaMA post measured

Observed throughput gains

The compatibility trap

Configuration notes that matter in practice

What operators can take from it

Related Articles

LocalLLaMA User Says Gemma 4 26B A3B Finally Makes Local Tool Calling Feel Stable

Reddit Says Gemma 4 on llama.cpp Is Finally Stable, With Caveats

LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA User Says Gemma 4 26B A3B Finally Makes Local Tool Calling Feel Stable
LLM Reddit Apr 7, 2026 2 min read

Reddit Says Gemma 4 on llama.cpp Is Finally Stable, With Caveats
LLM Reddit Apr 9, 2026 2 min read

LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup
LLM Reddit Apr 12, 2026 2 min read