PFlash Achieves 10x Prefill Speedup Over llama.cpp at 128K Context on RTX 3090
Original: PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 View original →
The Prefill Bottleneck
For long-context LLM inference, prefill is a critical bottleneck. While Q4_K_M Qwen3.6-27B decodes at ~74 tokens/sec on an RTX 3090, prefill scales at O(S²). On a 131K-token prompt, vanilla llama.cpp takes over 248 seconds.
How PFlash Works
PFlash introduces speculative prefill: a small drafter model loaded in-process scores token importance across the full prompt, and the heavy target model only prefills the spans that matter. The entire inference loop is implemented in C++/CUDA — no Python, no Triton, no PyTorch.
Benchmark Results
- 128K context: 24.8s TTFT vs 257s (llama.cpp) = 10.4x speedup
- 64K context: 13.5s vs 134.95s = 10.0x speedup
NIAH (Needle In A Haystack) retrieval accuracy is preserved end-to-end.
Open Source
Available at github.com/Luce-Org/lucebox-hub under an MIT license. The LocalLLaMA community has already been combining PFlash with DFlash speculative decoding for additional gains on consumer hardware.
Related Articles
Open-source PFlash uses speculative prefill to dramatically cut time-to-first-token for long-context LLM inference, achieving 10.4x speedup on Qwen3.6-27B Q4_K_M with a consumer RTX 3090.
LocalLLaMA got animated because the post promised something people can feel immediately: less reasoning drag. A user claims a small GBNF constraint cut Qwen3.6 token burn hard enough to speed up long tasks without wrecking benchmark scores.
Why it matters: kernel work is what decides whether long-context and edge-side agent systems stay theoretical or become cheap enough to run. Qwen says FlashQLA delivers 2-3x forward speedup and 2x backward speedup over the FLA Triton kernel on NVIDIA Hopper.
Comments (0)
No comments yet. Be the first to comment!