Responses API WebSockets cut agent loop latency by up to 40%

Original: Responses API WebSockets cut agent loops by up to 40% View original →

Read in other languages: 한국어日本語
LLM Apr 30, 2026 By Insights AI 2 min read Source

What the developer tweet actually changed

OpenAI’s developer account did not just promote a transport tweak. It pointed to a deeper shift in agent infrastructure: inference is no longer the only bottleneck. The tweet says WebSockets in the Responses API keep state warm across tool calls and make workflows up to 40% faster end to end. That matters because once coding agents start reading files, running tests, and looping through tools, small bits of API overhead can add up to minutes of waiting.

“WebSockets keep response state warm across tool calls, helping workflows run up to 40% faster.”

OpenAI’s engineering post fills in the scale of the problem. Previous flagship models on the Responses API ran around 65 tokens per second, but GPT-5.3-Codex-Spark pushed the target to nearly 1,000 tokens per second and exposed the overhead of repeated request validation, routing, and history processing. OpenAI says the fix was to keep a persistent connection, reuse previous response state in memory, and avoid redoing work on every tool round trip. The article also says the system hit bursts up to 4,000 tokens per second in production traffic after the WebSocket mode rollout.

Why builders should care

The strongest proof is downstream adoption. OpenAI says Codex moved most of its Responses API traffic onto WebSockets quickly, while Vercel saw up to 40% lower latency, Cline reported 39% faster multi-file workflows, and Cursor gained up to 30% speedups on OpenAI models. Those numbers turn a protocol choice into a product decision. A faster loop means less idle time between model action and tool result, which is exactly where agent systems tend to feel sluggish.

The OpenAIDevs account usually posts changes that affect builders directly, not vague roadmap talk, so this signal is operational rather than aspirational. What to watch next is how quickly more agent frameworks adopt persistent connections by default and whether similar latency work shows up around other tool-heavy surfaces such as browser automation and computer use. As inference speeds keep rising, the winning stacks will be the ones that stop wasting that speed everywhere else. Source: OpenAIDevs source tweet · OpenAI engineering post

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment