Responses API WebSockets make OpenAI agent loops up to 40% faster

Original: Speeding up agentic workflows with WebSockets in the Responses API View original →

Read in other languages: 한국어日本語
LLM Apr 23, 2026 By Insights AI 2 min read 1 views Source

Agent benchmarks often hide the dull part that users actually feel: waiting for the plumbing between tool calls. OpenAI’s April 22 engineering post argues that by early 2026, inference was no longer the slowest step in many coding agents. Once models got fast enough, the API layer itself became the drag. That is why this release matters. It is not a new model, but a transport change that cuts the dead time around the model.

OpenAI says agent loops using the Responses API are now up to 40% faster end-to-end. The company frames the change against a hard internal target: older flagship models such as GPT-5 and GPT-5.2 ran at roughly 65 tokens per second, while GPT-5.3-Codex-Spark was supposed to push beyond 1,000 tokens per second on specialized Cerebras hardware. At that speed, repeatedly rebuilding conversation state, validating the same context, and reopening the same request path stopped being a rounding error. It became the bottleneck.

The fix was to stop treating every turn like a fresh request. OpenAI added WebSocket mode so clients can keep a persistent connection to the Responses API while the server holds a connection-scoped in-memory cache of prior response state. Follow-up calls still use the familiar response.create pattern, but when the client passes previous_response_id, the server can reuse the earlier response object, prior input and output items, tool definitions, namespaces, and even rendered-token artifacts instead of rebuilding the full history from scratch. Before that redesign, the company says it had already squeezed nearly 45% improvement in time to first token through smaller optimizations such as caching rendered tokens, trimming network hops, and speeding parts of the safety stack. WebSockets were the structural step that moved the ceiling higher.

The interesting part is how quickly the gains showed up in tools people already use. OpenAI says Codex shifted the majority of its Responses API traffic onto WebSocket mode, Vercel saw up to 40% lower latency after integrating it into the AI SDK, Cline’s multi-file workflows became 39% faster, and OpenAI models in Cursor became up to 30% faster. For GPT-5.3-Codex-Spark, OpenAI says production traffic hit the 1,000-TPS target and burst to 4,000 TPS. In practical terms, that means the next phase of agent competition may depend less on raw model quality alone and more on who can keep the surrounding stack from wasting the model’s speed.

Share: Long

Related Articles

LLM 4d ago 2 min read

OpenAI says more than 3 million developers use Codex each week, and the desktop app is now moving beyond code edits. The update adds background computer use on macOS, an in-app browser, gpt-image-1.5 image generation, 90+ new plugins, PR review workflows, SSH devboxes in alpha, automations, and memory preview.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.