Hacker News Resurfaces a Fully Local Home Assistant Voice Stack Built Around llama.cpp

A practical local-first voice stack, not a lab demo

On March 16, 2026, Hacker News resurfaced a detailed Home Assistant community write-up that reached 310 points and 92 comments at crawl time. The value of the post is that it does not stop at saying local AI is possible. It documents a complete voice stack, the tradeoffs behind each component, and the latency the author actually saw on different hardware. The original forum post was published on October 27, 2025, but the HN discussion gave it fresh visibility.

The author describes moving away from Google Home and Nest Mini toward a fully local Home Assistant Assist setup. The voice server runs on a Beelink MiniPC with a USB4 eGPU enclosure, and the post lists results across GPUs from an RTX 3050 to an RTX 3090 and RX 7900XTX. In the posted table, the RTX 3090 24GB and RX 7900XTX 24GB can handle 20B to 30B MoE or 9B dense models with 1 to 2 second response times after prompt caching. Even an RTX 5060Ti 16GB is described as landing around 1.5 to 3 seconds. That moves the conversation from hobbyist curiosity to something close to household-grade usability.

The software stack is equally specific. The write-up recommends llama.cpp as the model runner, pairs Wyoming ONNX ASR with Nvidia Parakeet V2 through an OpenVINO branch for about 0.3 second CPU speech recognition, and compares Kokoro TTS with Piper for voice output. On the Home Assistant side, the author leans on LLM Conversation and llm-intents. But the strongest lesson in the post is that model choice alone did not make the system good. Prompt design and tool routing mattered more. The author says weather, place lookup, search, and music control each needed explicit sections and output examples, and extra prompt work was required to remove emoji or unwanted chatter from spoken answers.

Another useful detail is the pragmatic use of automations instead of forcing the model to do everything. Music playback, for example, was routed through a sentence trigger such as Play {music} tied directly to Music Assistant. The wake word was also customized: the household settled on Hey Robot, and the author trained a microWakeWord model for it. The motivation started with privacy and avoiding cloud outages, but the conclusion is operational: the local setup became more enjoyable and more reliable for the core tasks that mattered.

That explains why the HN thread landed. This is one of the clearer public blueprints showing that local voice assistants can now be assembled from commodity components, careful prompt work, and a willingness to automate around weak spots instead of pretending they do not exist.

Primary source: Home Assistant community post. Community discussion: Hacker News.

Hacker News Resurfaces a Fully Local Home Assistant Voice Stack Built Around llama.cpp

A practical local-first voice stack, not a lab demo

Related Articles

LocalLLaMA Jumps on Gemma-4 Audio Support in llama-server

LocalLLaMA Hears a Breakthrough in Qwen3 TTS: Real-Time, Local, and Finally Expressive

A 2016 Xeon Runs Gemma 4, but the Real Story Is Memory Bandwidth

Related Articles

LocalLLaMA Jumps on Gemma-4 Audio Support in llama-server
LLM Reddit Apr 15, 2026 2 min read

LocalLLaMA Hears a Breakthrough in Qwen3 TTS: Real-Time, Local, and Finally Expressive
LLM Reddit Apr 24, 2026 2 min read

A 2016 Xeon Runs Gemma 4, but the Real Story Is Memory Bandwidth
LLM Hacker News Jun 2, 2026 2 min read