Gemini 3.1 Flash TTS adds audio tags and 70+ languages
Original: Gemini 3.1 Flash TTS: the next generation of expressive AI speech View original →
Gemini 3.1 Flash TTS matters because speech models are no longer judged only by whether they sound clean. The harder question is control: can a developer ask a voice to slow down, shift tone, switch speakers, or keep a character consistent without building a separate production stack? In an April 15 post, Google said the new model brings audio tags to text-to-speech, letting instructions inside the input steer vocal style, pace, and delivery.
The rollout is broader than a research demo. Google says 3.1 Flash TTS is available in preview for developers through the Gemini API and Google AI Studio, in preview for enterprises on Vertex AI, and for Workspace users through Google Vids. That puts the same model across prototyping, enterprise deployment, and video creation workflows, which is exactly where voice agents and localized media production are beginning to overlap.
The hard numbers are the hook. Gemini 3.1 Flash TTS supports 70+ languages and posted an Elo score of 1,211 on the Artificial Analysis TTS leaderboard, which is built from blind human preferences. Google also points to native multi-speaker dialogue, Audio Profiles, Director's Notes, and inline tags as tools for directing speech output. In practical terms, the model is trying to turn a prompt into something closer to a voice performance brief.
The safety detail is not a footnote. Google says all audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID, its imperceptible marker for identifying AI-generated media. The next test is whether those controls stay reliable outside polished demos: noisy scripts, long-form narration, multiple speakers, and languages with less commercial training data. Source: Google Keyword.
For developers, the important boundary is consistency. A short demo voice is easy to impress with; a product voice has to keep the same persona across retries, speaker changes, and localization passes. By putting Audio Profiles and inline instructions near the prompt, Google is trying to make that control inspectable instead of hiding it in a separate studio layer.
Related Articles
A Hacker News thread pushed a GitHub repo claiming it can detect and weaken Gemini image SynthID watermarks using signal processing alone. The more important debate was not the headline claim itself, but whether the project had been validated against Google's own detector and what that says about watermark-based provenance overall.
Google on April 8 began rolling out Gemini for Home early access in Japan. The update moves Google Home from fixed commands toward conversational control, AI camera summaries, and natural-language video search.
Google said on March 27, 2026 that Google Translate's Live translate with headphones is now on iOS and expanding to more countries for both Android and iOS users. Google's official product pages say the feature supports 70+ languages, works with any pair of headphones, and builds on Gemini speech-to-speech translation designed to preserve tone, emphasis, and cadence.
Comments (0)
No comments yet. Be the first to comment!