A high-signal LocalLLaMA thread formed around Voxtral TTS because Mistral paired low latency, multilingual support, and open weights in a part of the stack many teams still keep closed.
#speech
RSS FeedLiveKit said on March 19, 2026 that it trained an audio model that can distinguish real user interruptions from backchannels and other noise. The company’s blog says the feature is now generally available in LiveKit Agents, delivers 86% precision and 100% recall at 500 ms overlap speech, and is enabled by default in current Python and TypeScript agent SDKs.
Kitten TTS v0.8 drew Hacker News attention by promising ONNX-based speech synthesis in 15M to 80M models that can run locally on CPUs, while commenters stress-tested real-world usability.
Mistral has published Voxtral Realtime and Voxtral Mini Transcribe V2, adding sub-200ms streaming transcription, 13-language support, and open weights for the realtime model. The company also paired the launch with an audio playground in Mistral Studio and aggressive API pricing at $0.003/min and $0.006/min.
A March 9, 2026 LocalLLaMA discussion highlighted Fish Audio’s S2 release, which combines fine-grained inline speech control, multilingual coverage, and an SGLang-based streaming stack.
IBM unveiled Granite 4.0 1B Speech on March 9, 2026 as a compact multilingual speech-language model for ASR and bidirectional speech translation. The company says it improves English transcription accuracy over its predecessor while cutting model size in half and adding Japanese support.
Developer Nick Tikhonov shares how he built a voice AI agent achieving ~400ms end-to-end latency with a full STT → LLM → TTS pipeline, including clean barge-ins and no precomputed responses.