vLLM lifts FP8 long-context accuracy from 13% to 89%

Original: vLLM restores FP8 long-context accuracy with a 13% to 89% jump View original →

Read in other languages: 한국어日本語
LLM Apr 28, 2026 By Insights AI 2 min read Source

What the benchmark result says

One of the biggest promises of low-precision inference is cheaper, faster serving without wrecking model quality. The vLLM project says it found a way to recover much of the accuracy that FP8 KV-cache had been losing on long-context workloads. In the project’s X post, maintainers highlighted a deep dive showing that a two-level accumulation fix raised performance on a 128k needle-in-a-haystack task from 13% to 89%, while keeping the FP8 decode speedup.

“two-level accumulation in FA3 takes 128k needle-in-a-haystack from 13% → 89%, while keeping the FP8 decode speedup”

The vllm_project account is the core release channel for one of the most widely used open-source inference runtimes, so these posts matter because they usually tie directly to deployable code paths rather than marketing claims. The linked technical post, authored around work from AWS and Red Hat AI, explains the underlying issue clearly. On the same 128k task, the BF16 baseline was 91%, while FP8 attention had collapsed to 13% because of imprecise accumulation behavior inside the attention path. The new two-level accumulation approach brought the result back to 89%, which is close enough to make FP8 look viable again for some long-context deployments.

Why this matters beyond one benchmark

The post also mentions a new --kv-cache-dtype-skip-layers flag for hybrid-attention models such as gpt-oss. That is important because production inference rarely depends on one neat lab setting. Operators need knobs that let them keep the speed benefits of quantization while routing around layers that are unusually sensitive. In other words, the story here is not only a prettier chart. It is that vLLM is turning a known FP8 quality failure into something practitioners may be able to manage with explicit engineering controls.

What to watch next is reproducibility across more model families, especially hybrid-attention and MoE systems, and whether the recovered accuracy holds outside needle-in-a-haystack style evaluations. If it does, FP8 KV-cache becomes less of a risky expert-only optimization and more of a mainstream deployment option for long-context inference. Source: vLLM source tweet · vLLM FP8 deep dive

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.