LocalLLaMA developer shares Whisper silence hallucination fixes from production logs

A high-engagement post in r/LocalLLaMA describes a practical failure mode many teams see only after deployment: Whisper generating fluent text when there is no speech. The author says they observed the issue across thousands of production meeting-audio hours and published a blocklist of recurring outputs.

The post reports 135 recurring English phrases, including common outro-like strings such as "Thanks for watching" and repeated loop patterns that can continue for long spans. The author argues this is a decoder behavior, not random garbage text: when audio is silent, Whisper can still produce likely completions from its training distribution.

The mitigation stack shared in the post is operationally specific:

Silero VAD pre-gating so non-speech audio is filtered before Whisper runs (threshold 0.5, stop after 3 consecutive non-voice frames).
Setting condition_on_previous_text=False to stop error carryover between windows.
Using exact-string blocklists by language for known recurring hallucinations.
Detecting repeated outputs and force-advancing timestamps on loop patterns.
Using beam_size=1 so silence failures terminate faster than wider beam search.

The author also cites the FAccT 2024 "Careless Whisper" paper and highlights safety risk in domains like medical transcription, where false fluent text can be more dangerous than blanks. The linked repository includes a publicly shared hallucination list (hallucinations/en.txt), which currently shows 134 text lines plus metadata headers in the raw file.

This is community-reported evidence rather than a controlled benchmark, but the post is useful because it translates a known model behavior into concrete production guardrails that teams can immediately test.

Community source: r/LocalLLaMA post
Referenced repo: Vexa (open-source)

AI X/Twitter Mar 28, 2026 2 min read

Cohere pushes Transcribe as an open 2B ASR model with a WebGPU browser demo

Cohere said on March 28, 2026 that Transcribe is setting a new bar for speech recognition accuracy in real-world noise and linked users to try it. The supporting Hugging Face materials position Transcribe as an Apache 2.0, 2B-parameter ASR model for 14 languages, while a companion WebGPU demo shows the model running locally in the browser.

#cohere #transcribe #speech-recognition

AI Hacker News Apr 1, 2026 1 min read

Cohere launches open-source 2B ASR model Transcribe

Cohere has entered the speech stack race with Transcribe, a 2B Apache 2.0 ASR model for 14 languages. Open weights, Hugging Face distribution, and a claimed 5.42 average WER headline the release.

#cohere #speech-recognition #asr

AI Hacker News May 16, 2026 1 min read

NVIDIA's SANA-WM: Open-Source 2.6B World Model for 1-Minute 720p Video

NVIDIA Labs released SANA-WM, a 2.6B parameter open-source world model capable of generating up to one minute of 720p video. The relatively small model size and open-source availability make it a significant contribution to accessible video generation research.

#video-generation #nvidia #open-source

Related Articles

Cohere pushes Transcribe as an open 2B ASR model with a WebGPU browser demo

Cohere launches open-source 2B ASR model Transcribe

NVIDIA's SANA-WM: Open-Source 2.6B World Model for 1-Minute 720p Video