xAI opens its Text-to-Speech API with streaming, speech tags, and five voices
Original: Grok's Text to Speech API is now available. Start building with natural voices and expressive controls to bring your apps to life. http://x.ai/api/voice#text-to-speech View original →
What xAI announced on X
On March 16, 2026, xAI said Grok's Text to Speech API is now available and pitched it as a way to build apps with natural voices and more expressive controls. The X post was brief, but it marked a clear expansion of xAI's public API surface beyond text and reasoning into deployable audio generation.
That matters because text-to-speech is not just a demo feature. Once an API reaches production use, it becomes infrastructure for voice assistants, narration, accessibility layers, call flows, and multimodal applications that need audio output with predictable latency and format control.
What the official voice docs specify
xAI's official voice documentation describes the Text to Speech API as a beta service at POST https://api.x.ai/v1/tts. The docs say the endpoint accepts up to 4,096 characters of text, supports inline speech tags for expressive delivery, and returns output in formats ranging from standard web audio to telephony-oriented codecs.
- xAI's docs list five voices:
eve,ara,leo,rex, andsal. - Supported output options include
mp3,wav,pcm,mulaw, andalaw, covering browser playback, raw pipelines, and call-center style telephony use cases. - For real-time use, xAI also documents a streaming WebSocket endpoint at
wss://api.x.ai/v1/tts, where audio is returned incrementally as base64-encoded chunks.
The broader voice overview page places this TTS surface alongside xAI's interactive Voice Agent API, which suggests xAI is building a layered voice stack: one endpoint for direct speech generation and another for full conversational agents.
Why this matters
For developers, the important point is control. A usable voice API needs more than a single synthetic voice and a downloadable file. It needs low-latency streaming, format choices that match deployment environments, and expressive controls for emphasis, pacing, and tone. xAI is explicitly trying to cover those requirements from the start.
Strategically, this moves xAI closer to the broader race for multimodal developer platforms. If Grok is going to appear in customer support, media generation, enterprise workflows, or agentic products, voice output has to be first-class infrastructure. The release does not settle questions about long-term pricing or production reliability, but it does show that xAI wants its API to compete on more than text alone.
Sources: xAI X post · xAI Text to Speech docs · xAI Voice overview
Related Articles
xAI says it is working with Gopuff on a personalized shopping assistant. The notable detail is multimodal commerce: chat, voice, and image models tied to product discovery and buying intent.
xAI officially launched Voice Cloning through its API, allowing users to clone a custom voice in under 2 minutes or select from 80+ pre-built voices across 28 languages for voice agents, audiobooks, and game characters.
Grok now supports four new connectors — Vercel for web deployment, Canva for visual content creation, Gamma for presentation design, and S&P Global for real-time market data. The expansion puts Grok in direct competition with Anthropic's MCP ecosystem and OpenAI's ChatGPT Connectors in the growing AI agent integration market.