vLLM lifts FP8 long-context accuracy from 13% to 89%
Original: vLLM restores FP8 long-context accuracy with a 13% to 89% jump View original →
What the benchmark result says
One of the biggest promises of low-precision inference is cheaper, faster serving without wrecking model quality. The vLLM project says it found a way to recover much of the accuracy that FP8 KV-cache had been losing on long-context workloads. In the project’s X post, maintainers highlighted a deep dive showing that a two-level accumulation fix raised performance on a 128k needle-in-a-haystack task from 13% to 89%, while keeping the FP8 decode speedup.
“two-level accumulation in FA3 takes 128k needle-in-a-haystack from 13% → 89%, while keeping the FP8 decode speedup”
The vllm_project account is the core release channel for one of the most widely used open-source inference runtimes, so these posts matter because they usually tie directly to deployable code paths rather than marketing claims. The linked technical post, authored around work from AWS and Red Hat AI, explains the underlying issue clearly. On the same 128k task, the BF16 baseline was 91%, while FP8 attention had collapsed to 13% because of imprecise accumulation behavior inside the attention path. The new two-level accumulation approach brought the result back to 89%, which is close enough to make FP8 look viable again for some long-context deployments.
Why this matters beyond one benchmark
The post also mentions a new --kv-cache-dtype-skip-layers flag for hybrid-attention models such as gpt-oss. That is important because production inference rarely depends on one neat lab setting. Operators need knobs that let them keep the speed benefits of quantization while routing around layers that are unusually sensitive. In other words, the story here is not only a prettier chart. It is that vLLM is turning a known FP8 quality failure into something practitioners may be able to manage with explicit engineering controls.
What to watch next is reproducibility across more model families, especially hybrid-attention and MoE systems, and whether the recovered accuracy holds outside needle-in-a-haystack style evaluations. If it does, FP8 KV-cache becomes less of a risky expert-only optimization and more of a mainstream deployment option for long-context inference. Source: vLLM source tweet · vLLM FP8 deep dive
Related Articles
Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.
A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.
A popular r/LocalLLaMA self-post lays out a concrete 2x H200 serving stack for GPT-OSS-120B, including routing, monitoring, and queueing tradeoffs. The appeal is not just the headline throughput, but the unusually detailed operational data behind it.
Comments (0)
No comments yet. Be the first to comment!