OpenAI and Perplexity share production lessons from scaling voice agents with the Realtime API

Original: 📣 Lessons from building voice agents at scale @perplexity_ai breaks down how running voice with the Realtime API in production shaped their approach to context, audio pipelines, and turn-taking in real-world environments. developers.openai.com/blog/r… View original →

Read in other languages: 한국어日本語
LLM Mar 30, 2026 By Insights AI 2 min read 1 views Source

What OpenAI and Perplexity described

OpenAI Developers said on March 30, 2026 that Perplexity has published a case study on building voice agents at scale with the Realtime API. In the official write-up, OpenAI and Perplexity say Perplexity uses Realtime-1.5 in production across products such as Perplexity Comet and Perplexity Computer, and that the company now manages millions of voice sessions every month. The article frames voice as a core interface, not a side feature, because users want to hand off work conversationally and watch an agent complete it.

The most valuable part of the post is that it focuses on operational lessons instead of launch marketing. Perplexity explains that the hard part was not just getting speech in and audio out. It was making long-running, tool-using voice agents stay stable when context grows, clients send different native audio buffers, and users speak in noisy environments with interruptions, hesitations, and mid-task corrections.

What changed in production

One concrete lesson involved context management. Perplexity says its early approach tried to push large transcript updates, but that failed in an all-or-nothing way. If a 10,000-token update arrived when the model had room for only 5,000 more tokens, the system could lose all prior history at once. The team changed course and began feeding context in 2,000-token chunks instead, accepting some overhead in exchange for more graceful truncation and more stable interactions.

The post also emphasizes that conversation semantics matter as much as raw token count. Perplexity found that if too much browsing context was inserted as user input, the model behaved as if the user had literally spoken every page fragment out loud. If too much was inserted as system, the model blurred the line between inherent knowledge, provided context, and the actual question. The team says getting these roles right was essential for making voice interactions feel natural.

A separate lesson came from audio infrastructure. Perplexity operates across clients built in Swift, TypeScript, Rust, and C++, and the company says inconsistent native audio buffers created uneven performance. Standardizing audio across product surfaces reduced that mismatch. The article also stresses the need to tune for messy real-world environments, where the model has to handle interruptions, background noise, and turn-taking without losing responsiveness.

Why it matters

This case study matters because it shows where production voice agents are actually fragile. The bottleneck is not only model quality. It is context management, message semantics, audio normalization, and interaction design under imperfect conditions. Those are the engineering details that turn a good voice demo into a voice system people can trust for regular use.

For developers building agent products, the broader takeaway is that voice is becoming infrastructure. Once teams operate at the scale Perplexity describes, design choices like chunk size, role labeling, and tool selection stop being implementation trivia and become product-defining architecture. That is a useful reality check for any company treating real-time multimodal agents as the next major interface layer.

Share: Long

Related Articles

LLM sources.twitter 6d ago 2 min read

OpenAI said on X on March 17, 2026 that GPT-5.4 mini was available in ChatGPT, Codex, and the API. The launch positions mini as a faster coding and multimodal workhorse, while OpenAI’s accompanying post also introduces GPT-5.4 nano for cheaper API-only workloads.

LLM sources.twitter 4d ago 2 min read

Google DeepMind said on March 26, 2026 that Gemini 3.1 Flash Live is rolling out in preview via the Live API in Google AI Studio. Google’s blog says the model is designed for real-time voice and vision agents, improves tool triggering in noisy environments, and supports more than 90 languages for multimodal conversations.

LLM sources.twitter 3d ago 2 min read

OpenAI Devs said on March 26, 2026 that plugins are rolling out in Codex, letting the agent work with common tools such as Slack, Figma, Notion, and Gmail. OpenAI's Codex docs describe plugins as reusable bundles that package skills, app integrations, and MCP server settings, turning Codex into a more shareable workflow layer for teams.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.