OpenAI and Perplexity share production lessons from scaling voice agents with the Realtime API

What OpenAI and Perplexity described

OpenAI Developers said on March 30, 2026 that Perplexity has published a case study on building voice agents at scale with the Realtime API. In the official write-up, OpenAI and Perplexity say Perplexity uses Realtime-1.5 in production across products such as Perplexity Comet and Perplexity Computer, and that the company now manages millions of voice sessions every month. The article frames voice as a core interface, not a side feature, because users want to hand off work conversationally and watch an agent complete it.

The most valuable part of the post is that it focuses on operational lessons instead of launch marketing. Perplexity explains that the hard part was not just getting speech in and audio out. It was making long-running, tool-using voice agents stay stable when context grows, clients send different native audio buffers, and users speak in noisy environments with interruptions, hesitations, and mid-task corrections.

What changed in production

One concrete lesson involved context management. Perplexity says its early approach tried to push large transcript updates, but that failed in an all-or-nothing way. If a 10,000-token update arrived when the model had room for only 5,000 more tokens, the system could lose all prior history at once. The team changed course and began feeding context in 2,000-token chunks instead, accepting some overhead in exchange for more graceful truncation and more stable interactions.

The post also emphasizes that conversation semantics matter as much as raw token count. Perplexity found that if too much browsing context was inserted as user input, the model behaved as if the user had literally spoken every page fragment out loud. If too much was inserted as system, the model blurred the line between inherent knowledge, provided context, and the actual question. The team says getting these roles right was essential for making voice interactions feel natural.

A separate lesson came from audio infrastructure. Perplexity operates across clients built in Swift, TypeScript, Rust, and C++, and the company says inconsistent native audio buffers created uneven performance. Standardizing audio across product surfaces reduced that mismatch. The article also stresses the need to tune for messy real-world environments, where the model has to handle interruptions, background noise, and turn-taking without losing responsiveness.

Why it matters

This case study matters because it shows where production voice agents are actually fragile. The bottleneck is not only model quality. It is context management, message semantics, audio normalization, and interaction design under imperfect conditions. Those are the engineering details that turn a good voice demo into a voice system people can trust for regular use.

For developers building agent products, the broader takeaway is that voice is becoming infrastructure. Once teams operate at the scale Perplexity describes, design choices like chunk size, role labeling, and tool selection stop being implementation trivia and become product-defining architecture. That is a useful reality check for any company treating real-time multimodal agents as the next major interface layer.

OpenAI and Perplexity share production lessons from scaling voice agents with the Realtime API

What OpenAI and Perplexity described

What changed in production

Why it matters

Related Articles

OpenAI puts Lockdown Mode in ChatGPT as agent security gets practical

OpenAI Releases Three Realtime Voice API Models with GPT-5-Class Reasoning

OpenAI Brings Codex to ChatGPT Mobile App