Fish Audio S2 Brings Inline Emotion Control and Fast Streaming to Open TTS

Original: Fish Audio Releases S2: open-source, controllable and expressive TTS model View original →

Read in other languages: 한국어日本語
AI Mar 15, 2026 By Insights AI (Reddit) 2 min read 2 views Source

Why this release stood out on LocalLLaMA

A March 9, 2026 post in r/LocalLLaMA pointed to Fish Audio’s S2 announcement and its model card. The attention makes sense. The project is not just another text-to-speech checkpoint. It combines open-weight release materials, fine-tuning code, and a production-oriented inference story around SGLang streaming.

The headline feature is fine-grained inline control. Instead of relying on a small fixed emotion taxonomy, S2 accepts free-form natural-language tags inside the text, such as whispering, laughter, pitch changes, or broadcast-style delivery. Fish Audio says that lets users control prosody and expression locally at the word or phrase level, which is a different interface from conventional speaker conditioning or coarse style tokens.

What the architecture and metrics suggest

According to Fish Audio’s documentation, S2 Pro uses a dual-autoregressive design on top of an RVQ-based audio codec. A 4B “slow AR” path models the main time-axis structure, while a 400M “fast AR” path reconstructs residual acoustic detail. The company argues that this structure preserves fidelity without paying the full sequence-length penalty of flattening every codec stream into one giant autoregressive pass.

The reported numbers are strong, though they are still vendor-reported. Fish Audio says S2 was trained on more than 10 million hours of audio data, supports 80+ languages in the model card, reaches an Audio Turing Test posterior mean of 0.515, and posts an 81.88% win rate on EmergentTTS-Eval against a cited gpt-4o-mini-tts baseline. On serving, the blog claims roughly 100 ms time-to-first-audio and 3,000+ acoustic tokens per second on an NVIDIA H200 with a real-time factor of 0.195.

Why the release matters beyond one model

The more strategic point is that Fish Audio is framing TTS like modern LLM infrastructure. Because the Dual-AR design remains close enough to autoregressive language-model serving patterns, the system can reuse batching, KV-cache, CUDA graph, and prefix-caching optimizations from the LLM stack. That lowers the gap between “research model” and “production voice service.”

There is one major caveat: the release is not under a permissive software-style license. The model card lists the Fish Audio Research License, with research and non-commercial use allowed and separate commercial licensing required. Even with that limitation, S2 is a notable open-model milestone because it joins controllability, streaming, and multilingual coverage in one package.

Primary sources: Fish Audio blog, model card. Community discussion: r/LocalLLaMA.

Share: Long

Related Articles

AI sources.twitter 5d ago 1 min read

OpenAI said Codex Security is rolling out in research preview via Codex web. The company positioned it as a context-aware application security agent that reduces noise while surfacing higher-confidence findings and patches.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.