Hacker News Resurfaces a Fully Local Home Assistant Voice Stack Built Around llama.cpp
Original: My Journey to a reliable and enjoyable locally hosted voice assistant (2025) View original →
A practical local-first voice stack, not a lab demo
On March 16, 2026, Hacker News resurfaced a detailed Home Assistant community write-up that reached 310 points and 92 comments at crawl time. The value of the post is that it does not stop at saying local AI is possible. It documents a complete voice stack, the tradeoffs behind each component, and the latency the author actually saw on different hardware. The original forum post was published on October 27, 2025, but the HN discussion gave it fresh visibility.
The author describes moving away from Google Home and Nest Mini toward a fully local Home Assistant Assist setup. The voice server runs on a Beelink MiniPC with a USB4 eGPU enclosure, and the post lists results across GPUs from an RTX 3050 to an RTX 3090 and RX 7900XTX. In the posted table, the RTX 3090 24GB and RX 7900XTX 24GB can handle 20B to 30B MoE or 9B dense models with 1 to 2 second response times after prompt caching. Even an RTX 5060Ti 16GB is described as landing around 1.5 to 3 seconds. That moves the conversation from hobbyist curiosity to something close to household-grade usability.
The software stack is equally specific. The write-up recommends llama.cpp as the model runner, pairs Wyoming ONNX ASR with Nvidia Parakeet V2 through an OpenVINO branch for about 0.3 second CPU speech recognition, and compares Kokoro TTS with Piper for voice output. On the Home Assistant side, the author leans on LLM Conversation and llm-intents. But the strongest lesson in the post is that model choice alone did not make the system good. Prompt design and tool routing mattered more. The author says weather, place lookup, search, and music control each needed explicit sections and output examples, and extra prompt work was required to remove emoji or unwanted chatter from spoken answers.
Another useful detail is the pragmatic use of automations instead of forcing the model to do everything. Music playback, for example, was routed through a sentence trigger such as Play {music} tied directly to Music Assistant. The wake word was also customized: the household settled on Hey Robot, and the author trained a microWakeWord model for it. The motivation started with privacy and avoiding cloud outages, but the conclusion is operational: the local setup became more enjoyable and more reliable for the core tasks that mattered.
That explains why the HN thread landed. This is one of the clearer public blueprints showing that local voice assistants can now be assembled from commodity components, careful prompt work, and a willingness to automate around weak spots instead of pretending they do not exist.
Primary source: Home Assistant community post. Community discussion: Hacker News.
Related Articles
The LocalLLaMA thread took off because native speech-to-text inside llama.cpp is exactly the kind of feature that removes an extra pipeline from local agent setups. The post says llama-server can now run STT with Gemma-4 E2A and E4A models, and commenters immediately started comparing the practical experience to Whisper and Voxtral.
LocalLLaMA was not impressed by another TTS clip so much as by a build log. The post that took off showed Qwen3-TTS running locally in real time, quantized through llama.cpp, with extra alignment work to make subtitles and lip sync behave.
LocalLLaMA upvoted this because a 27B open model suddenly looked competitive on agent-style work, not because everyone agreed on the benchmark. The thread stayed lively precisely because the result felt important and a little suspicious at the same time.
Comments (0)
No comments yet. Be the first to comment!