Responses API WebSockets make OpenAI agent loops up to 40% faster
Original: Speeding up agentic workflows with WebSockets in the Responses API View original →
Agent benchmarks often hide the dull part that users actually feel: waiting for the plumbing between tool calls. OpenAI’s April 22 engineering post argues that by early 2026, inference was no longer the slowest step in many coding agents. Once models got fast enough, the API layer itself became the drag. That is why this release matters. It is not a new model, but a transport change that cuts the dead time around the model.
OpenAI says agent loops using the Responses API are now up to 40% faster end-to-end. The company frames the change against a hard internal target: older flagship models such as GPT-5 and GPT-5.2 ran at roughly 65 tokens per second, while GPT-5.3-Codex-Spark was supposed to push beyond 1,000 tokens per second on specialized Cerebras hardware. At that speed, repeatedly rebuilding conversation state, validating the same context, and reopening the same request path stopped being a rounding error. It became the bottleneck.
The fix was to stop treating every turn like a fresh request. OpenAI added WebSocket mode so clients can keep a persistent connection to the Responses API while the server holds a connection-scoped in-memory cache of prior response state. Follow-up calls still use the familiar response.create pattern, but when the client passes previous_response_id, the server can reuse the earlier response object, prior input and output items, tool definitions, namespaces, and even rendered-token artifacts instead of rebuilding the full history from scratch. Before that redesign, the company says it had already squeezed nearly 45% improvement in time to first token through smaller optimizations such as caching rendered tokens, trimming network hops, and speeding parts of the safety stack. WebSockets were the structural step that moved the ceiling higher.
The interesting part is how quickly the gains showed up in tools people already use. OpenAI says Codex shifted the majority of its Responses API traffic onto WebSocket mode, Vercel saw up to 40% lower latency after integrating it into the AI SDK, Cline’s multi-file workflows became 39% faster, and OpenAI models in Cursor became up to 30% faster. For GPT-5.3-Codex-Spark, OpenAI says production traffic hit the 1,000-TPS target and burst to 4,000 TPS. In practical terms, that means the next phase of agent competition may depend less on raw model quality alone and more on who can keep the surrounding stack from wasting the model’s speed.
Related Articles
Why it matters: faster models stop feeling fast if orchestration overhead eats the gain. OpenAI says WebSocket mode made agent workflows up to 40% faster end to end, while lifting effective inference speed from about 65 to nearly 1,000 tokens per second.
OpenAI Developers published a March 11, 2026 engineering write-up explaining how the Responses API uses a hosted computer environment for long-running agent workflows. The post centers on shell execution, hosted containers, controlled network access, reusable skills, and native compaction for context management.
OpenAI Developers said recent Codex usage data suggests developers are handing off long-running work like refactors and architecture planning at the end of the day. In a follow-up reply, the account said tasks started at 11 pm are 60% more likely than other tasks to run for 3+ hours.