Cloudflare cuts Kimi K2.5 token latency to 20-30 ms
Original: Building the foundation for running extra-large language models View original →
Cloudflare’s latest LLM infrastructure post asks a more useful question than “which model is available?” It asks what has to change in the serving stack when trillion-parameter models become the engine for agent workloads. In a technical post dated April 16, 2026, Cloudflare says it has made Kimi K2.5 on Workers AI 3x faster.
The headline metric is latency. Cloudflare says that after moving traffic to a prefill decode disaggregated architecture, p90 Time to First Token improved even as request volume increased and the GPU count stayed the same. The sharper number is p90 time per token: it moved from roughly 100 ms with high variance to 20-30 ms. For interactive agents and coding assistants, that kind of intertoken latency change is not abstract infrastructure trivia; it changes whether the product feels responsive.
The architecture is built around the shape of agent traffic. Agents accumulate system prompts, tools, MCPs, previous messages and generated code. Each turn can send a large amount of input context before the model generates the next output. Cloudflare says Workers AI therefore focused on fast input token processing and fast tool calling. Prefill and decode run on separate server pools, and token-aware load balancing estimates in-flight prefill and decode tokens across endpoints to spread work more evenly.
Prompt caching is the second lever. Cloudflare uses an x-session-affinity header to route requests toward a region that already has computed input tensors for a session. After working with heavy internal users, the company says input-token cache hit ratios rose from 60% to 80% during peak times. That matters because a small miss rate in long-context agent sessions can turn into a material number of extra GPUs.
The post also details Infire, Cloudflare’s Rust inference engine. Kimi K2.5 is described as over 1 trillion parameters with about 560GB of model weights, requiring at least 8 H100s before extra KV-cache needs are counted. Cloudflare says Infire can run Llama 4 Scout on two H200 GPUs with more than 56 GiB left for KV-cache, enough for more than 1.2m tokens, and can run Kimi K2.5 on 8 H100 GPUs with more than 30 GiB left for KV-cache. The company also says its largest models can begin serving requests in under 20 seconds and that Infire can deliver up to 20% higher tokens-per-second throughput on unconstrained systems.
Related Articles
HN focused on the plumbing question: does a 14-plus-provider inference layer actually make agent apps easier to operate? Cloudflare framed AI Gateway, Workers AI bindings, and a broader multimodal catalog as one platform, while commenters compared it with OpenRouter and pressed on pricing accuracy, catalog overlap, and deployment trust.
Cloudflare is trying to make model choice less sticky: AI Gateway now routes Workers AI calls to 70+ models across 12+ providers through one interface. For agent builders, the important part is not the catalog alone but spend controls, retry behavior, and failover in workflows that may chain ten inference calls for one task.
Cloudflare said on X on March 19 that Kimi K2.5 is now available on Workers AI. The launch pairs a frontier open-source model with platform features aimed at lowering latency and cost for agent workloads.
Comments (0)
No comments yet. Be the first to comment!