PFlash Achieves 10x Prefill Speedup Over llama.cpp at 128K Context on RTX 3090

The Prefill Bottleneck

For long-context LLM inference, prefill is a critical bottleneck. While Q4_K_M Qwen3.6-27B decodes at ~74 tokens/sec on an RTX 3090, prefill scales at O(S²). On a 131K-token prompt, vanilla llama.cpp takes over 248 seconds.

How PFlash Works

PFlash introduces speculative prefill: a small drafter model loaded in-process scores token importance across the full prompt, and the heavy target model only prefills the spans that matter. The entire inference loop is implemented in C++/CUDA — no Python, no Triton, no PyTorch.

Benchmark Results

128K context: 24.8s TTFT vs 257s (llama.cpp) = 10.4x speedup
64K context: 13.5s vs 134.95s = 10.0x speedup

NIAH (Needle In A Haystack) retrieval accuracy is preserved end-to-end.

Open Source

Available at github.com/Luce-Org/lucebox-hub under an MIT license. The LocalLLaMA community has already been combining PFlash with DFlash speculative decoding for additional gains on consumer hardware.

LLM Reddit 1h ago 1 min read

PFlash Achieves 10x Prefill Speedup Over llama.cpp at 128K Context on RTX 3090

Open-source PFlash uses speculative prefill to dramatically cut time-to-first-token for long-context LLM inference, achieving 10.4x speedup on Qwen3.6-27B Q4_K_M with a consumer RTX 3090.

#llama.cpp #inference #prefill

LLM Reddit 3d ago 2 min read

A GBNF tweak that slashed Qwen3.6 token churn gave LocalLLaMA a rare practical win

LocalLLaMA got animated because the post promised something people can feel immediately: less reasoning drag. A user claims a small GBNF constraint cut Qwen3.6 token burn hard enough to speed up long tasks without wrecking benchmark scores.

#qwen #llama.cpp #gbnf

LLM X/Twitter 1d ago 2 min read

Qwen's FlashQLA lifts linear attention speed 2-3x on Hopper

Why it matters: kernel work is what decides whether long-context and edge-side agent systems stay theoretical or become cheap enough to run. Qwen says FlashQLA delivers 2-3x forward speedup and 2x backward speedup over the FLA Triton kernel on NVIDIA Hopper.

#qwen #linear-attention #kernels

PFlash Achieves 10x Prefill Speedup Over llama.cpp at 128K Context on RTX 3090

The Prefill Bottleneck

How PFlash Works

Benchmark Results

Open Source

Related Articles

PFlash Achieves 10x Prefill Speedup Over llama.cpp at 128K Context on RTX 3090

A GBNF tweak that slashed Qwen3.6 token churn gave LocalLLaMA a rare practical win

Qwen's FlashQLA lifts linear attention speed 2-3x on Hopper

Comments (0)

Leave a Comment