Mistral expands its speech stack with Voxtral Realtime and Voxtral Mini Transcribe V2
Original: Voxtral transcribes at the speed of sound. View original →
Mistral has expanded its speech product line with two closely linked releases: Voxtral Realtime for low-latency streaming transcription and Voxtral Mini Transcribe V2 for high-efficiency batch transcription. Together, the launch gives Mistral a more complete voice stack that spans live voice agents, subtitle generation, meeting transcription, and post-call processing. The company also introduced a new audio playground in Mistral Studio, signaling that this is intended as a developer platform update rather than just a research drop.
Voxtral Realtime is the more strategically interesting of the two models because it targets applications where delay directly shapes user experience. Mistral said the model is purpose-built for streaming audio instead of adapting an offline model to chunked input, and that latency can be configured down to sub-200ms. At 2.4 seconds of delay, the company says Realtime matches Voxtral Mini Transcribe V2, and at 480ms it remains within 1-2% word error rate. Mistral also says the model supports 13 languages, runs with a 4B parameter footprint, and ships under Apache 2.0 on Hugging Face, which makes it unusually usable for privacy-sensitive and edge deployments.
Voxtral Mini Transcribe V2 is positioned as the price-performance workhorse. Mistral reports roughly 4% word error rate on the FLEURS benchmark at $0.003/min and claims it outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy. The company also says it processes audio about 3x faster than ElevenLabs Scribe v2 while matching quality at one-fifth the cost. Feature-wise, the batch model includes speaker diarization, context biasing for up to 100 words or phrases, word-level timestamps, 13-language support, noise robustness, and up to 3 hours of audio in a single request. Those are the practical capabilities developers need for production meeting notes, call analytics, and multimedia indexing.
Mistral packaged the release with tooling that lowers experimentation cost. The audio playground in Mistral Studio supports up to 10 uploaded files, diarization toggles, timestamp granularity controls, and context bias terms, with support for .mp3, .wav, .m4a, .flac, and .ogg files up to 1GB each. API pricing is set at $0.006/min for Voxtral Realtime and $0.003/min for Voxtral Mini Transcribe V2. Mistral also says both models can be deployed in GDPR- and HIPAA-compliant setups through on-premise or private cloud environments. In a market where speech features are becoming a core part of agent stacks, Mistral is clearly trying to compete on latency, openness, and operating cost all at once.
Related Articles
A March 9, 2026 LocalLLaMA discussion highlighted Fish Audio’s S2 release, which combines fine-grained inline speech control, multilingual coverage, and an SGLang-based streaming stack.
Together AI said on March 12, 2026 that it is launching a one-cloud stack for real-time voice agents. Its public materials describe co-located STT, LLM, and TTS infrastructure with under-500ms latency, 25+ regions, and separate kernel work that cut time-to-first-64-tokens to 77ms in a voice-agent deployment.
OpenAI announced on X that Codex Security has entered research preview. The company positions it as an application security agent that can detect, validate, and patch complex vulnerabilities with more context and less noise.
Comments (0)
No comments yet. Be the first to comment!