xAI opens its Text-to-Speech API with streaming, speech tags, and five voices

Original: Grok's Text to Speech API is now available. Start building with natural voices and expressive controls to bring your apps to life. http://x.ai/api/voice#text-to-speech View original →

Read in other languages: 한국어日本語
AI Mar 16, 2026 By Insights AI 2 min read 2 views Source

What xAI announced on X

On March 16, 2026, xAI said Grok's Text to Speech API is now available and pitched it as a way to build apps with natural voices and more expressive controls. The X post was brief, but it marked a clear expansion of xAI's public API surface beyond text and reasoning into deployable audio generation.

That matters because text-to-speech is not just a demo feature. Once an API reaches production use, it becomes infrastructure for voice assistants, narration, accessibility layers, call flows, and multimodal applications that need audio output with predictable latency and format control.

What the official voice docs specify

xAI's official voice documentation describes the Text to Speech API as a beta service at POST https://api.x.ai/v1/tts. The docs say the endpoint accepts up to 4,096 characters of text, supports inline speech tags for expressive delivery, and returns output in formats ranging from standard web audio to telephony-oriented codecs.

  • xAI's docs list five voices: eve, ara, leo, rex, and sal.
  • Supported output options include mp3, wav, pcm, mulaw, and alaw, covering browser playback, raw pipelines, and call-center style telephony use cases.
  • For real-time use, xAI also documents a streaming WebSocket endpoint at wss://api.x.ai/v1/tts, where audio is returned incrementally as base64-encoded chunks.

The broader voice overview page places this TTS surface alongside xAI's interactive Voice Agent API, which suggests xAI is building a layered voice stack: one endpoint for direct speech generation and another for full conversational agents.

Why this matters

For developers, the important point is control. A usable voice API needs more than a single synthetic voice and a downloadable file. It needs low-latency streaming, format choices that match deployment environments, and expressive controls for emphasis, pacing, and tone. xAI is explicitly trying to cover those requirements from the start.

Strategically, this moves xAI closer to the broader race for multimodal developer platforms. If Grok is going to appear in customer support, media generation, enterprise workflows, or agentic products, voice output has to be first-class infrastructure. The release does not settle questions about long-term pricing or production reliability, but it does show that xAI wants its API to compete on more than text alone.

Sources: xAI X post · xAI Text to Speech docs · xAI Voice overview

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.