Cloudflare cuts Kimi K2.5 token latency to 20-30 ms
Original: Building the foundation for running extra-large language models View original →
Cloudflare’s latest LLM infrastructure post asks a more useful question than “which model is available?” It asks what has to change in the serving stack when trillion-parameter models become the engine for agent workloads. In a technical post dated April 16, 2026, Cloudflare says it has made Kimi K2.5 on Workers AI 3x faster.
The headline metric is latency. Cloudflare says that after moving traffic to a prefill decode disaggregated architecture, p90 Time to First Token improved even as request volume increased and the GPU count stayed the same. The sharper number is p90 time per token: it moved from roughly 100 ms with high variance to 20-30 ms. For interactive agents and coding assistants, that kind of intertoken latency change is not abstract infrastructure trivia; it changes whether the product feels responsive.
The architecture is built around the shape of agent traffic. Agents accumulate system prompts, tools, MCPs, previous messages and generated code. Each turn can send a large amount of input context before the model generates the next output. Cloudflare says Workers AI therefore focused on fast input token processing and fast tool calling. Prefill and decode run on separate server pools, and token-aware load balancing estimates in-flight prefill and decode tokens across endpoints to spread work more evenly.
Prompt caching is the second lever. Cloudflare uses an x-session-affinity header to route requests toward a region that already has computed input tensors for a session. After working with heavy internal users, the company says input-token cache hit ratios rose from 60% to 80% during peak times. That matters because a small miss rate in long-context agent sessions can turn into a material number of extra GPUs.
The post also details Infire, Cloudflare’s Rust inference engine. Kimi K2.5 is described as over 1 trillion parameters with about 560GB of model weights, requiring at least 8 H100s before extra KV-cache needs are counted. Cloudflare says Infire can run Llama 4 Scout on two H200 GPUs with more than 56 GiB left for KV-cache, enough for more than 1.2m tokens, and can run Kimi K2.5 on 8 H100 GPUs with more than 30 GiB left for KV-cache. The company also says its largest models can begin serving requests in under 20 seconds and that Infire can deliver up to 20% higher tokens-per-second throughput on unconstrained systems.
Related Articles
The money is following the layer that decides which model gets each request. OpenRouter says weekly traffic rose 5x in six months to 25 trillion tokens, while its platform now spans 400+ models and more than 8 million users.
LocalLLaMA readers noticed the infrastructure lesson: Zai claimed 15% more GPU inference throughput and 40.6% lower first-token P99 latency with the same GPUs, model, and software stack.
The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.
Comments (0)
No comments yet. Be the first to comment!