Cloudflare brings Kimi K2.5 to Workers AI and shows how it cut internal agent costs
Original: Kimi K2.5 is now available on #WorkersAI. You can now build and run agents end-to-end on the Cloudflare Developer Platform. Read about how we tuned our inference stack to drive down costs for internal agent workflows. https://cfl.re/4bmpZgb View original →
What Cloudflare posted on X
On March 20, 2026, Cloudflare said Kimi K2.5 is now available on Workers AI, framing the release as a way to build and run agents end to end on the Cloudflare Developer Platform. The tweet also pointed to new details on how Cloudflare tuned its inference stack to reduce costs for internal agent workloads.
That framing is important. Cloudflare is not just adding another model endpoint. It is positioning Workers AI as the model layer inside a broader agent runtime that already includes Durable Objects for state, Workflows for long-running jobs, and container-based execution through Dynamic Workers or Sandbox.
What the Cloudflare blog adds
The March 19 Cloudflare blog says Workers AI is moving into the large-model tier, starting with Moonshot AI's Kimi K2.5. Cloudflare describes the model as having a 256K context window with support for multi-turn tool calling, vision inputs, and structured outputs, which makes it a better fit for agent workflows than smaller open models.
The most concrete part of the post is Cloudflare's internal usage data. The company says engineers already use Kimi inside OpenCode for agentic coding tasks and inside an automated code review workflow exposed publicly through the Bonk agent on Cloudflare repositories. In one security-review use case, Cloudflare says the system processes more than 7B tokens per day, found more than 15 confirmed issues in a single codebase, and would have cost about $2.4M per year on a mid-tier proprietary model. By switching to Workers AI, it says the same workload cost 77% less.
Cloudflare also paired the launch with platform changes for agent traffic. It now surfaces cached tokens as a usage metric, discounts cached tokens relative to fresh input tokens, adds an x-session-affinity header to improve prefix-cache hit rates, and revamps its asynchronous API for durable high-volume jobs such as research or code-scanning agents.
Why this matters
The bigger signal is economic, not just technical. As teams move from occasional prompt calls to always-on coding, search, and security agents, inference cost becomes a scaling constraint long before model availability does. Cloudflare is arguing that open large models plus platform-level serving optimizations can close enough of the capability gap to make agents financially viable at higher volume.
If that claim holds, the competitive battleground shifts toward infrastructure: cache behavior, async execution, throughput tuning, and integration with the rest of the runtime. In other words, model hosting is becoming inseparable from agent-platform design.
Sources: Cloudflare X post · Cloudflare blog
Related Articles
Cloudflare said on March 20, 2026 that Kimi K2.5 was available on Workers AI so developers could build end-to-end agents on Cloudflare’s platform. Its launch post says the model brings a 256k context window, multi-turn tool calling, vision inputs, and structured outputs, while an internal security-review agent processing 7B tokens per day cut costs by 77% after the switch.
Cloudflare said on March 19, 2026 that Workers AI now supports Moonshot AI's Kimi K2.5. The company is using the model to argue that a unified agent platform can offer both strong tool use and much lower production cost.
A March 15, 2026 Hacker News post about GreenBoost reached 124 points and 25 comments. The open-source Linux project combines a kernel module and CUDA shim to tier model memory across VRAM, DDR4, and NVMe so larger local LLMs can run without changing inference apps.
Comments (0)
No comments yet. Be the first to comment!