#speculative-decoding

LLM Hacker News 2h ago 1 min read

Google Releases Multi-Token Prediction Drafters for Gemma 4

Google has released open-weight MTP drafter models for Gemma 4 31B and 26B-A4B, enabling speculative decoding to significantly boost inference speed without affecting output quality.

#google #gemma #mtp

LLM Reddit Apr 28, 2026 2 min read

LocalLLaMA likes Luce DFlash because the 3090 speedup looks practical

LocalLLaMA did not treat Luce DFlash as another benchmark screenshot. The post took off because it promised almost 2x mean throughput for Qwen3.6-27B on a single RTX 3090, with no retraining and enough memory engineering to keep long-context local inference practical.

#qwen #speculative-decoding #gguf

LLM Reddit Apr 14, 2026 2 min read

Reddit Spots an Open-Source DFlash Runtime That Pushes Qwen3.5 to 4x Speeds on Apple Silicon

LocalLLaMA paid attention to this post because it looked like real engineering cleanup instead of another inflated speed screenshot. On April 13, 2026, the author said a stock-MLX baseline for Qwen3.5-9B at 2048 tokens improved from 30.96 tok/s to 127.07 tok/s, with 89.36% acceptance and the full runtime released as open source.

#dflash #speculative-decoding #mlx

LLM Reddit Apr 13, 2026 2 min read

r/LocalLLaMA tests lossless speculative decoding on Apple Silicon with DFlash and MLX

A fresh r/LocalLLaMA post published DFlash benchmarking on M5 Max with MLX 0.31.1 and reported 127.07 tok/s and a 4.13x speedup on Qwen3.5-9B. The most useful part is not the headline number but the post’s clear reproduction setup and bandwidth-bound interpretation.

#mlx #apple-silicon #speculative-decoding

LLM Reddit Apr 13, 2026 2 min read

LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average

A detailed `r/LocalLLaMA` benchmark reports that pairing `Gemma 4 31B` with `Gemma 4 E2B` as a draft model in `llama.cpp` lifted average throughput from `57.17 t/s` to `73.73 t/s`.

#speculative-decoding #gemma-4 #llama-cpp

LLM Reddit Apr 12, 2026 2 min read

LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

A new r/LocalLLaMA benchmark reports that Gemma 4 31B paired with an E2B draft model can gain about 29% average throughput, with code generation improving by roughly 50%.

#gemma-4 #speculative-decoding #llama-cpp

LLM Reddit Apr 11, 2026 2 min read

LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.

#apple-silicon #mlx #speculative-decoding

LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.

#speculative-decoding #inference #vllm

LLM X/Twitter Apr 1, 2026 2 min read

Together Research releases Aurora for RL-based adaptive speculative decoding

Together Research said on March 31, 2026 that Aurora is an open-source framework for adaptive speculative decoding that learns from live inference traces and updates the speculator asynchronously without interrupting serving. Together’s blog and paper say Aurora reframes the problem as asynchronous RL and can deliver 1.25x additional speedup over a strong static speculator as traffic shifts.

#together-ai #aurora #speculative-decoding

LLM Reddit Mar 21, 2026 3 min read

r/LocalLLaMA Spots Native MTP for Qwen3.5 in mlx-lm and Faster Single-Stream Inference

A Reddit thread in r/LocalLLaMA spotlighted mlx-lm PR #990, which uses Qwen3.5's built-in MTP head for native speculative decoding and reports 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. The gain is meaningful, but so are the constraints around converted checkpoints, disabled batching, and untested MoE variants.

#mlx-lm #qwen3.5 #mtp