xAI opens its Text-to-Speech API with streaming, speech tags, and five voices

What xAI announced on X

On March 16, 2026, xAI said Grok's Text to Speech API is now available and pitched it as a way to build apps with natural voices and more expressive controls. The X post was brief, but it marked a clear expansion of xAI's public API surface beyond text and reasoning into deployable audio generation.

That matters because text-to-speech is not just a demo feature. Once an API reaches production use, it becomes infrastructure for voice assistants, narration, accessibility layers, call flows, and multimodal applications that need audio output with predictable latency and format control.

What the official voice docs specify

xAI's official voice documentation describes the Text to Speech API as a beta service at POST https://api.x.ai/v1/tts. The docs say the endpoint accepts up to 4,096 characters of text, supports inline speech tags for expressive delivery, and returns output in formats ranging from standard web audio to telephony-oriented codecs.

xAI's docs list five voices: eve, ara, leo, rex, and sal.
Supported output options include mp3, wav, pcm, mulaw, and alaw, covering browser playback, raw pipelines, and call-center style telephony use cases.
For real-time use, xAI also documents a streaming WebSocket endpoint at wss://api.x.ai/v1/tts, where audio is returned incrementally as base64-encoded chunks.

The broader voice overview page places this TTS surface alongside xAI's interactive Voice Agent API, which suggests xAI is building a layered voice stack: one endpoint for direct speech generation and another for full conversational agents.

Why this matters

For developers, the important point is control. A usable voice API needs more than a single synthetic voice and a downloadable file. It needs low-latency streaming, format choices that match deployment environments, and expressive controls for emphasis, pacing, and tone. xAI is explicitly trying to cover those requirements from the start.

Strategically, this moves xAI closer to the broader race for multimodal developer platforms. If Grok is going to appear in customer support, media generation, enterprise workflows, or agentic products, voice output has to be first-class infrastructure. The release does not settle questions about long-term pricing or production reliability, but it does show that xAI wants its API to compete on more than text alone.

Sources: xAI X post · xAI Text to Speech docs · xAI Voice overview

xAI opens its Text-to-Speech API with streaming, speech tags, and five voices

What xAI announced on X

What the official voice docs specify

Why this matters

Related Articles

xAI Launches Voice Cloning API: Create a Custom Voice in Under 2 Minutes

Grok Voice agents now cost $0.05 per minute to build

xAI Launches Grok 4.2 Public Beta with Multi-Agent System and Rapid Weekly Learning