PFlash Achieves 10x Prefill Speedup Over llama.cpp at 128K Context on RTX 3090

Original: PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 View original →

Read in other languages: 한국어日本語
LLM May 2, 2026 By Insights AI (Reddit) 1 min read Source

The Prefill Bottleneck

For long-context LLM inference, prefill is a critical bottleneck. While Q4_K_M Qwen3.6-27B decodes at ~74 tokens/sec on an RTX 3090, prefill scales at O(S²). On a 131K-token prompt, vanilla llama.cpp takes over 248 seconds.

How PFlash Works

PFlash introduces speculative prefill: a small drafter model loaded in-process scores token importance across the full prompt, and the heavy target model only prefills the spans that matter. The entire inference loop is implemented in C++/CUDA — no Python, no Triton, no PyTorch.

Benchmark Results

  • 128K context: 24.8s TTFT vs 257s (llama.cpp) = 10.4x speedup
  • 64K context: 13.5s vs 134.95s = 10.0x speedup

NIAH (Needle In A Haystack) retrieval accuracy is preserved end-to-end.

Open Source

Available at github.com/Luce-Org/lucebox-hub under an MIT license. The LocalLLaMA community has already been combining PFlash with DFlash speculative decoding for additional gains on consumer hardware.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment