LocalLLaMA developer shares Whisper silence hallucination fixes from production logs
Original: We collected 135 phrases Whisper hallucinates during silence — here's what it says when nobody's talking and how we stopped it View original →
A high-engagement post in r/LocalLLaMA describes a practical failure mode many teams see only after deployment: Whisper generating fluent text when there is no speech. The author says they observed the issue across thousands of production meeting-audio hours and published a blocklist of recurring outputs.
The post reports 135 recurring English phrases, including common outro-like strings such as "Thanks for watching" and repeated loop patterns that can continue for long spans. The author argues this is a decoder behavior, not random garbage text: when audio is silent, Whisper can still produce likely completions from its training distribution.
The mitigation stack shared in the post is operationally specific:
- Silero VAD pre-gating so non-speech audio is filtered before Whisper runs (threshold 0.5, stop after 3 consecutive non-voice frames).
- Setting
condition_on_previous_text=Falseto stop error carryover between windows. - Using exact-string blocklists by language for known recurring hallucinations.
- Detecting repeated outputs and force-advancing timestamps on loop patterns.
- Using
beam_size=1so silence failures terminate faster than wider beam search.
The author also cites the FAccT 2024 "Careless Whisper" paper and highlights safety risk in domains like medical transcription, where false fluent text can be more dangerous than blanks. The linked repository includes a publicly shared hallucination list (hallucinations/en.txt), which currently shows 134 text lines plus metadata headers in the raw file.
This is community-reported evidence rather than a controlled benchmark, but the post is useful because it translates a known model behavior into concrete production guardrails that teams can immediately test.
Community source: r/LocalLLaMA post
Referenced repo: Vexa (open-source)
Related Articles
Cohere said on March 28, 2026 that Transcribe is setting a new bar for speech recognition accuracy in real-world noise and linked users to try it. The supporting Hugging Face materials position Transcribe as an Apache 2.0, 2B-parameter ASR model for 14 languages, while a companion WebGPU demo shows the model running locally in the browser.
Cohere has entered the speech stack race with Transcribe, a 2B Apache 2.0 ASR model for 14 languages. Open weights, Hugging Face distribution, and a claimed 5.42 average WER headline the release.
HN reacted because fake stars are no longer just platform spam; they distort how AI and LLM repos look credible. The thread converged on a practical answer: read commits, issues, code, and real usage instead of treating stars as proof.
Comments (0)
No comments yet. Be the first to comment!