Fish Audio S2 Brings Inline Emotion Control and Fast Streaming to Open TTS

Why this release stood out on LocalLLaMA

A March 9, 2026 post in r/LocalLLaMA pointed to Fish Audio’s S2 announcement and its model card. The attention makes sense. The project is not just another text-to-speech checkpoint. It combines open-weight release materials, fine-tuning code, and a production-oriented inference story around SGLang streaming.

The headline feature is fine-grained inline control. Instead of relying on a small fixed emotion taxonomy, S2 accepts free-form natural-language tags inside the text, such as whispering, laughter, pitch changes, or broadcast-style delivery. Fish Audio says that lets users control prosody and expression locally at the word or phrase level, which is a different interface from conventional speaker conditioning or coarse style tokens.

What the architecture and metrics suggest

According to Fish Audio’s documentation, S2 Pro uses a dual-autoregressive design on top of an RVQ-based audio codec. A 4B “slow AR” path models the main time-axis structure, while a 400M “fast AR” path reconstructs residual acoustic detail. The company argues that this structure preserves fidelity without paying the full sequence-length penalty of flattening every codec stream into one giant autoregressive pass.

The reported numbers are strong, though they are still vendor-reported. Fish Audio says S2 was trained on more than 10 million hours of audio data, supports 80+ languages in the model card, reaches an Audio Turing Test posterior mean of 0.515, and posts an 81.88% win rate on EmergentTTS-Eval against a cited gpt-4o-mini-tts baseline. On serving, the blog claims roughly 100 ms time-to-first-audio and 3,000+ acoustic tokens per second on an NVIDIA H200 with a real-time factor of 0.195.

Why the release matters beyond one model

The more strategic point is that Fish Audio is framing TTS like modern LLM infrastructure. Because the Dual-AR design remains close enough to autoregressive language-model serving patterns, the system can reuse batching, KV-cache, CUDA graph, and prefix-caching optimizations from the LLM stack. That lowers the gap between “research model” and “production voice service.”

There is one major caveat: the release is not under a permissive software-style license. The model card lists the Fish Audio Research License, with research and non-commercial use allowed and separate commercial licensing required. Even with that limitation, S2 is a notable open-model milestone because it joins controllability, streaming, and multilingual coverage in one package.

Primary sources: Fish Audio blog, model card. Community discussion: r/LocalLLaMA.

Fish Audio S2 Brings Inline Emotion Control and Fast Streaming to Open TTS

Why this release stood out on LocalLLaMA

What the architecture and metrics suggest

Why the release matters beyond one model

Related Articles

HN’s first question on VibeVoice: what is actually open this time?

Mistral launches Voxtral TTS as a low-latency multilingual speech layer for voice agents

Cohere launches open-source 2B ASR model Transcribe

Comments (0)

Leave a Comment

Related Articles

HN’s first question on VibeVoice: what is actually open this time?

Mistral launches Voxtral TTS as a low-latency multilingual speech layer for voice agents
AI X/Twitter Apr 5, 2026 2 min read

Cohere launches open-source 2B ASR model Transcribe
AI Hacker News Apr 1, 2026 1 min read