A 54-point Reddit post flagged merged PR #19441 as the moment qwen3-omni-moe and qwen3-asr support reached llama.cpp, with commenters focused on local multimodal and ASR use cases.
#audio
RSS FeedMistral AI said on March 26, 2026 that Voxtral TTS offers expressive speech, support for 9 languages and dialects, low latency, and easy adaptation to new voices. Mistral’s March 23 launch post says the 4B-parameter model can adapt from about three seconds of reference audio, reaches roughly 70ms model latency, supports up to two minutes of native audio generation, and is available by API and as open weights.
Mistral said on April 2, 2026 that developers can assemble a web-search-enabled speech-to-speech assistant in roughly 150 lines of code using Voxtral for transcription and speech generation plus Mistral Small 4 for agentic reasoning. The post is notable less as a single model launch than as a clear reference architecture for real-time audio agents.
xAI said on March 16, 2026 that Grok's Text-to-Speech API is now available. xAI's own voice docs describe a beta API with five voices, inline speech tags, telephony-friendly codecs, and a streaming WebSocket mode for low-latency applications.
Mistral has published Voxtral Realtime and Voxtral Mini Transcribe V2, adding sub-200ms streaming transcription, 13-language support, and open weights for the realtime model. The company also paired the launch with an audio playground in Mistral Studio and aggressive API pricing at $0.003/min and $0.006/min.
A March 9, 2026 LocalLLaMA discussion highlighted Fish Audio’s S2 release, which combines fine-grained inline speech control, multilingual coverage, and an SGLang-based streaming stack.