Cloudflare cuts Kimi K2.5 token latency to 20-30 ms

Cloudflare’s latest LLM infrastructure post asks a more useful question than “which model is available?” It asks what has to change in the serving stack when trillion-parameter models become the engine for agent workloads. In a technical post dated April 16, 2026, Cloudflare says it has made Kimi K2.5 on Workers AI 3x faster.

The headline metric is latency. Cloudflare says that after moving traffic to a prefill decode disaggregated architecture, p90 Time to First Token improved even as request volume increased and the GPU count stayed the same. The sharper number is p90 time per token: it moved from roughly 100 ms with high variance to 20-30 ms. For interactive agents and coding assistants, that kind of intertoken latency change is not abstract infrastructure trivia; it changes whether the product feels responsive.

The architecture is built around the shape of agent traffic. Agents accumulate system prompts, tools, MCPs, previous messages and generated code. Each turn can send a large amount of input context before the model generates the next output. Cloudflare says Workers AI therefore focused on fast input token processing and fast tool calling. Prefill and decode run on separate server pools, and token-aware load balancing estimates in-flight prefill and decode tokens across endpoints to spread work more evenly.

Prompt caching is the second lever. Cloudflare uses an x-session-affinity header to route requests toward a region that already has computed input tensors for a session. After working with heavy internal users, the company says input-token cache hit ratios rose from 60% to 80% during peak times. That matters because a small miss rate in long-context agent sessions can turn into a material number of extra GPUs.

The post also details Infire, Cloudflare’s Rust inference engine. Kimi K2.5 is described as over 1 trillion parameters with about 560GB of model weights, requiring at least 8 H100s before extra KV-cache needs are counted. Cloudflare says Infire can run Llama 4 Scout on two H200 GPUs with more than 56 GiB left for KV-cache, enough for more than 1.2m tokens, and can run Kimi K2.5 on 8 H100 GPUs with more than 30 GiB left for KV-cache. The company also says its largest models can begin serving requests in under 20 seconds and that Infire can deliver up to 20% higher tokens-per-second throughput on unconstrained systems.

Cloudflare cuts Kimi K2.5 token latency to 20-30 ms

Related Articles

Cloudflare brings Kimi K2.5 to Workers AI and says agent coding reviews cut costs by 77%

Cloudflare brings Kimi K2.5 to Workers AI and shows how it cut internal agent costs

Cloudflare brings Kimi K2.5 to Workers AI and tunes the stack for agents

Related Articles

Cloudflare brings Kimi K2.5 to Workers AI and says agent coding reviews cut costs by 77%
LLM X/Twitter Mar 22, 2026 2 min read

Cloudflare brings Kimi K2.5 to Workers AI and shows how it cut internal agent costs
LLM X/Twitter Mar 23, 2026 2 min read

Cloudflare brings Kimi K2.5 to Workers AI and tunes the stack for agents
LLM X/Twitter Mar 23, 2026 1 min read