Hacker News Resurfaces a Fully Local Home Assistant Voice Stack Built Around llama.cpp
Original: My Journey to a reliable and enjoyable locally hosted voice assistant (2025) View original →
A practical local-first voice stack, not a lab demo
On March 16, 2026, Hacker News resurfaced a detailed Home Assistant community write-up that reached 310 points and 92 comments at crawl time. The value of the post is that it does not stop at saying local AI is possible. It documents a complete voice stack, the tradeoffs behind each component, and the latency the author actually saw on different hardware. The original forum post was published on October 27, 2025, but the HN discussion gave it fresh visibility.
The author describes moving away from Google Home and Nest Mini toward a fully local Home Assistant Assist setup. The voice server runs on a Beelink MiniPC with a USB4 eGPU enclosure, and the post lists results across GPUs from an RTX 3050 to an RTX 3090 and RX 7900XTX. In the posted table, the RTX 3090 24GB and RX 7900XTX 24GB can handle 20B to 30B MoE or 9B dense models with 1 to 2 second response times after prompt caching. Even an RTX 5060Ti 16GB is described as landing around 1.5 to 3 seconds. That moves the conversation from hobbyist curiosity to something close to household-grade usability.
The software stack is equally specific. The write-up recommends llama.cpp as the model runner, pairs Wyoming ONNX ASR with Nvidia Parakeet V2 through an OpenVINO branch for about 0.3 second CPU speech recognition, and compares Kokoro TTS with Piper for voice output. On the Home Assistant side, the author leans on LLM Conversation and llm-intents. But the strongest lesson in the post is that model choice alone did not make the system good. Prompt design and tool routing mattered more. The author says weather, place lookup, search, and music control each needed explicit sections and output examples, and extra prompt work was required to remove emoji or unwanted chatter from spoken answers.
Another useful detail is the pragmatic use of automations instead of forcing the model to do everything. Music playback, for example, was routed through a sentence trigger such as Play {music} tied directly to Music Assistant. The wake word was also customized: the household settled on Hey Robot, and the author trained a microWakeWord model for it. The motivation started with privacy and avoiding cloud outages, but the conclusion is operational: the local setup became more enjoyable and more reliable for the core tasks that mattered.
That explains why the HN thread landed. This is one of the clearer public blueprints showing that local voice assistants can now be assembled from commodity components, careful prompt work, and a willingness to automate around weak spots instead of pretending they do not exist.
Primary source: Home Assistant community post. Community discussion: Hacker News.
Related Articles
A Launch HN thread pushed RunAnywhere's RCLI into view as an Apple Silicon-first macOS voice AI stack that combines STT, LLM, TTS, local RAG, and 38 system actions without relying on cloud APIs.
CanIRun.ai runs entirely in the browser, detects GPU, CPU, and RAM through WebGL, WebGPU, and navigator APIs, and estimates which quantized models fit your machine. HN readers liked the idea but immediately pushed on missing hardware entries, calibration, and reverse-lookup features.
A new llama.cpp change turns <code>--reasoning-budget</code> into a real sampler-side limit instead of a template stub. The LocalLLaMA thread focused on the tradeoff between cutting long think loops and preserving answer quality, especially for local Qwen 3.5 deployments.
Comments (0)
No comments yet. Be the first to comment!