Why it matters: xAI has turned the Grok Voice stack into standalone STT/TTS APIs with batch transcription at $0.10/hour and streaming at $0.20/hour. The post puts 25+ languages, diarization, and word-level timestamps in direct competition with enterprise transcription tools.
#speech-to-text
RSS FeedLocalLLaMA jumped on this because native audio in llama-server promises a much cleaner speech workflow for local AI. The first wave of comments loves the idea of dropping the extra Whisper service, but it is also documenting where long-form audio still breaks.
The LocalLLaMA thread took off because native speech-to-text inside llama.cpp is exactly the kind of feature that removes an extra pipeline from local agent setups. The post says llama-server can now run STT with Gemma-4 E2A and E4A models, and commenters immediately started comparing the practical experience to Whisper and Voxtral.
A 440-point Show HN thread put Ghost Pepper, a menu-bar macOS app that records on Control-hold and transcribes locally, into the agent-tooling conversation because its speech and cleanup stack stays on-device.