Decaying

Fish Audio S2 Brings Inline Emotion Control and Fast Streaming to Open TTS

Original: Fish Audio Releases S2: open-source, controllable and expressive TTS model View original →

Read in other languages: 한국어日本語
AI Mar 15, 2026 By Insights AI (Reddit) 2 min read 35 views Source

Why this release stood out on LocalLLaMA

A March 9, 2026 post in r/LocalLLaMA pointed to Fish Audio’s S2 announcement and its model card. The attention makes sense. The project is not just another text-to-speech checkpoint. It combines open-weight release materials, fine-tuning code, and a production-oriented inference story around SGLang streaming.

The headline feature is fine-grained inline control. Instead of relying on a small fixed emotion taxonomy, S2 accepts free-form natural-language tags inside the text, such as whispering, laughter, pitch changes, or broadcast-style delivery. Fish Audio says that lets users control prosody and expression locally at the word or phrase level, which is a different interface from conventional speaker conditioning or coarse style tokens.

What the architecture and metrics suggest

According to Fish Audio’s documentation, S2 Pro uses a dual-autoregressive design on top of an RVQ-based audio codec. A 4B “slow AR” path models the main time-axis structure, while a 400M “fast AR” path reconstructs residual acoustic detail. The company argues that this structure preserves fidelity without paying the full sequence-length penalty of flattening every codec stream into one giant autoregressive pass.

The reported numbers are strong, though they are still vendor-reported. Fish Audio says S2 was trained on more than 10 million hours of audio data, supports 80+ languages in the model card, reaches an Audio Turing Test posterior mean of 0.515, and posts an 81.88% win rate on EmergentTTS-Eval against a cited gpt-4o-mini-tts baseline. On serving, the blog claims roughly 100 ms time-to-first-audio and 3,000+ acoustic tokens per second on an NVIDIA H200 with a real-time factor of 0.195.

Why the release matters beyond one model

The more strategic point is that Fish Audio is framing TTS like modern LLM infrastructure. Because the Dual-AR design remains close enough to autoregressive language-model serving patterns, the system can reuse batching, KV-cache, CUDA graph, and prefix-caching optimizations from the LLM stack. That lowers the gap between “research model” and “production voice service.”

There is one major caveat: the release is not under a permissive software-style license. The model card lists the Fish Audio Research License, with research and non-commercial use allowed and separate commercial licensing required. Even with that limitation, S2 is a notable open-model milestone because it joins controllability, streaming, and multilingual coverage in one package.

Primary sources: Fish Audio blog, model card. Community discussion: r/LocalLLaMA.

Share: Long

Related Articles

AI X/Twitter Apr 5, 2026 2 min read

Mistral AI said on March 26, 2026 that Voxtral TTS offers expressive speech, support for 9 languages and dialects, low latency, and easy adaptation to new voices. Mistral’s March 23 launch post says the 4B-parameter model can adapt from about three seconds of reference audio, reaches roughly 70ms model latency, supports up to two minutes of native audio generation, and is available by API and as open weights.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment