Responses API WebSockets make OpenAI agent loops up to 40% faster

Agent benchmarks often hide the dull part that users actually feel: waiting for the plumbing between tool calls. OpenAI’s April 22 engineering post argues that by early 2026, inference was no longer the slowest step in many coding agents. Once models got fast enough, the API layer itself became the drag. That is why this release matters. It is not a new model, but a transport change that cuts the dead time around the model.

OpenAI says agent loops using the Responses API are now up to 40% faster end-to-end. The company frames the change against a hard internal target: older flagship models such as GPT-5 and GPT-5.2 ran at roughly 65 tokens per second, while GPT-5.3-Codex-Spark was supposed to push beyond 1,000 tokens per second on specialized Cerebras hardware. At that speed, repeatedly rebuilding conversation state, validating the same context, and reopening the same request path stopped being a rounding error. It became the bottleneck.

The fix was to stop treating every turn like a fresh request. OpenAI added WebSocket mode so clients can keep a persistent connection to the Responses API while the server holds a connection-scoped in-memory cache of prior response state. Follow-up calls still use the familiar response.create pattern, but when the client passes previous_response_id, the server can reuse the earlier response object, prior input and output items, tool definitions, namespaces, and even rendered-token artifacts instead of rebuilding the full history from scratch. Before that redesign, the company says it had already squeezed nearly 45% improvement in time to first token through smaller optimizations such as caching rendered tokens, trimming network hops, and speeding parts of the safety stack. WebSockets were the structural step that moved the ceiling higher.

The interesting part is how quickly the gains showed up in tools people already use. OpenAI says Codex shifted the majority of its Responses API traffic onto WebSocket mode, Vercel saw up to 40% lower latency after integrating it into the AI SDK, Cline’s multi-file workflows became 39% faster, and OpenAI models in Cursor became up to 30% faster. For GPT-5.3-Codex-Spark, OpenAI says production traffic hit the 1,000-TPS target and burst to 4,000 TPS. In practical terms, that means the next phase of agent competition may depend less on raw model quality alone and more on who can keep the surrounding stack from wasting the model’s speed.

Responses API WebSockets make OpenAI agent loops up to 40% faster

Related Articles

Responses API WebSockets cut agent loop latency by up to 40%

OpenAI details the computer environment behind the Responses API

OpenAI Developers says Codex users increasingly delegate long-running software tasks overnight