xAI is turning voice agents into production software, not a demo. Grok Voice Think Fast 1.0 tops τ-voice Bench, supports 25+ languages, and xAI says the same stack is driving a 20% sales conversion and 70% support resolution flow at Starlink.
#voice-agents
RSS FeedMistral AI said on March 26, 2026 that Voxtral TTS offers expressive speech, support for 9 languages and dialects, low latency, and easy adaptation to new voices. Mistral’s March 23 launch post says the 4B-parameter model can adapt from about three seconds of reference audio, reaches roughly 70ms model latency, supports up to two minutes of native audio generation, and is available by API and as open weights.
Google AI said on March 26, 2026 that Gemini 3.1 Flash Live is launching for developers building real-time voice and vision agents. Google highlighted faster natural dialogue, better task completion in noisy environments, and stronger complex-instruction following, while its Live API docs describe low-latency multimodal streaming with tool use and 70-language support.
OpenAI Developers said on March 30, 2026 that Perplexity has been running voice experiences with the Realtime API in production and published lessons from that work. The post says Perplexity now handles millions of monthly voice sessions and details how the team changed context chunking, standardized audio formats, and tuned turn-taking for noisy real-world environments.
Mistral promoted Voxtral TTS on X on March 26, 2026. Mistral's release post describes a 4B-parameter multilingual TTS model with nine-language support, low time-to-first-audio, availability in Mistral Studio and API, open weights on Hugging Face under CC BY-NC 4.0, and pricing at $0.016 per 1,000 characters.
Google DeepMind said on March 26, 2026 that Gemini 3.1 Flash Live is rolling out in preview via the Live API in Google AI Studio. Google’s blog says the model is designed for real-time voice and vision agents, improves tool triggering in noisy environments, and supports more than 90 languages for multimodal conversations.
LiveKit said on March 19, 2026 that it trained an audio model that can distinguish real user interruptions from backchannels and other noise. The company’s blog says the feature is now generally available in LiveKit Agents, delivers 86% precision and 100% recall at 500 ms overlap speech, and is enabled by default in current Python and TypeScript agent SDKs.
LiveKit said on X that xAI’s Grok text-to-speech is now available in LiveKit Inference with low-latency streaming, telephony readiness, and support for more than 20 languages. LiveKit’s docs say developers can access `xai/tts-1` through LiveKit Inference without a separate xAI API key or use the xAI plugin directly with `XAI_API_KEY`.
Together AI said on March 12, 2026 that it is launching a one-cloud stack for real-time voice agents. Its public materials describe co-located STT, LLM, and TTS infrastructure with under-500ms latency, 25+ regions, and separate kernel work that cut time-to-first-64-tokens to 77ms in a voice-agent deployment.
Running Nvidia PersonaPlex 7B in Swift on Apple Silicon moves local voice agents closer to real time
An HN post on a Swift/MLX port of Nvidia PersonaPlex 7B shows how chunking, buffering, and interrupt handling matter as much as raw model quality for local speech-to-speech agents.
A high-upvote LocalLLaMA thread highlighted KittenTTS v0.8, with community-shared details on 80M/40M/14M model variants, Apache-2.0 licensing, and an edge-friendly focus on local CPU inference.