vLLM lifts FP8 long-context accuracy from 13% to 89%

What the benchmark result says

One of the biggest promises of low-precision inference is cheaper, faster serving without wrecking model quality. The vLLM project says it found a way to recover much of the accuracy that FP8 KV-cache had been losing on long-context workloads. In the project’s X post, maintainers highlighted a deep dive showing that a two-level accumulation fix raised performance on a 128k needle-in-a-haystack task from 13% to 89%, while keeping the FP8 decode speedup.

“two-level accumulation in FA3 takes 128k needle-in-a-haystack from 13% → 89%, while keeping the FP8 decode speedup”

The vllm_project account is the core release channel for one of the most widely used open-source inference runtimes, so these posts matter because they usually tie directly to deployable code paths rather than marketing claims. The linked technical post, authored around work from AWS and Red Hat AI, explains the underlying issue clearly. On the same 128k task, the BF16 baseline was 91%, while FP8 attention had collapsed to 13% because of imprecise accumulation behavior inside the attention path. The new two-level accumulation approach brought the result back to 89%, which is close enough to make FP8 look viable again for some long-context deployments.

Why this matters beyond one benchmark

The post also mentions a new --kv-cache-dtype-skip-layers flag for hybrid-attention models such as gpt-oss. That is important because production inference rarely depends on one neat lab setting. Operators need knobs that let them keep the speed benefits of quantization while routing around layers that are unusually sensitive. In other words, the story here is not only a prettier chart. It is that vLLM is turning a known FP8 quality failure into something practitioners may be able to manage with explicit engineering controls.

What to watch next is reproducibility across more model families, especially hybrid-attention and MoE systems, and whether the recovered accuracy holds outside needle-in-a-haystack style evaluations. If it does, FP8 KV-cache becomes less of a risky expert-only optimization and more of a mainstream deployment option for long-context inference. Source: vLLM source tweet · vLLM FP8 deep dive

vLLM lifts FP8 long-context accuracy from 13% to 89%

What the benchmark result says

Why this matters beyond one benchmark

Related Articles

Cohere W4A8 vLLM path claims 58% faster first-token latency

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally

Comments (0)

Leave a Comment

Related Articles

Cohere W4A8 vLLM path claims 58% faster first-token latency

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding
LLM Reddit Apr 7, 2026 2 min read

r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally
LLM Reddit Apr 8, 2026 2 min read