Mistral expands its speech stack with Voxtral Realtime and Voxtral Mini Transcribe V2

Mistral has expanded its speech product line with two closely linked releases: Voxtral Realtime for low-latency streaming transcription and Voxtral Mini Transcribe V2 for high-efficiency batch transcription. Together, the launch gives Mistral a more complete voice stack that spans live voice agents, subtitle generation, meeting transcription, and post-call processing. The company also introduced a new audio playground in Mistral Studio, signaling that this is intended as a developer platform update rather than just a research drop.

Voxtral Realtime is the more strategically interesting of the two models because it targets applications where delay directly shapes user experience. Mistral said the model is purpose-built for streaming audio instead of adapting an offline model to chunked input, and that latency can be configured down to sub-200ms. At 2.4 seconds of delay, the company says Realtime matches Voxtral Mini Transcribe V2, and at 480ms it remains within 1-2% word error rate. Mistral also says the model supports 13 languages, runs with a 4B parameter footprint, and ships under Apache 2.0 on Hugging Face, which makes it unusually usable for privacy-sensitive and edge deployments.

Voxtral Mini Transcribe V2 is positioned as the price-performance workhorse. Mistral reports roughly 4% word error rate on the FLEURS benchmark at $0.003/min and claims it outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy. The company also says it processes audio about 3x faster than ElevenLabs Scribe v2 while matching quality at one-fifth the cost. Feature-wise, the batch model includes speaker diarization, context biasing for up to 100 words or phrases, word-level timestamps, 13-language support, noise robustness, and up to 3 hours of audio in a single request. Those are the practical capabilities developers need for production meeting notes, call analytics, and multimedia indexing.

Mistral packaged the release with tooling that lowers experimentation cost. The audio playground in Mistral Studio supports up to 10 uploaded files, diarization toggles, timestamp granularity controls, and context bias terms, with support for .mp3, .wav, .m4a, .flac, and .ogg files up to 1GB each. API pricing is set at $0.006/min for Voxtral Realtime and $0.003/min for Voxtral Mini Transcribe V2. Mistral also says both models can be deployed in GDPR- and HIPAA-compliant setups through on-premise or private cloud environments. In a market where speech features are becoming a core part of agent stacks, Mistral is clearly trying to compete on latency, openness, and operating cost all at once.

Mistral expands its speech stack with Voxtral Realtime and Voxtral Mini Transcribe V2

Related Articles

Mistral launches Voxtral TTS as a low-latency multilingual speech layer for voice agents

Mistral opens Workflows preview to harden enterprise AI ops

Fish Audio S2 Brings Inline Emotion Control and Fast Streaming to Open TTS

Comments (0)

Leave a Comment

Related Articles

Mistral launches Voxtral TTS as a low-latency multilingual speech layer for voice agents
AI X/Twitter Apr 5, 2026 2 min read

Mistral opens Workflows preview to harden enterprise AI ops

Fish Audio S2 Brings Inline Emotion Control and Fast Streaming to Open TTS
AI Reddit Mar 15, 2026 2 min read