Hacker News Resurfaces a Fully Local Home Assistant Voice Stack Built Around llama.cpp

Original: My Journey to a reliable and enjoyable locally hosted voice assistant (2025) View original →

Read in other languages: 한국어日本語
LLM Mar 17, 2026 By Insights AI (HN) 2 min read Source

A practical local-first voice stack, not a lab demo

On March 16, 2026, Hacker News resurfaced a detailed Home Assistant community write-up that reached 310 points and 92 comments at crawl time. The value of the post is that it does not stop at saying local AI is possible. It documents a complete voice stack, the tradeoffs behind each component, and the latency the author actually saw on different hardware. The original forum post was published on October 27, 2025, but the HN discussion gave it fresh visibility.

The author describes moving away from Google Home and Nest Mini toward a fully local Home Assistant Assist setup. The voice server runs on a Beelink MiniPC with a USB4 eGPU enclosure, and the post lists results across GPUs from an RTX 3050 to an RTX 3090 and RX 7900XTX. In the posted table, the RTX 3090 24GB and RX 7900XTX 24GB can handle 20B to 30B MoE or 9B dense models with 1 to 2 second response times after prompt caching. Even an RTX 5060Ti 16GB is described as landing around 1.5 to 3 seconds. That moves the conversation from hobbyist curiosity to something close to household-grade usability.

The software stack is equally specific. The write-up recommends llama.cpp as the model runner, pairs Wyoming ONNX ASR with Nvidia Parakeet V2 through an OpenVINO branch for about 0.3 second CPU speech recognition, and compares Kokoro TTS with Piper for voice output. On the Home Assistant side, the author leans on LLM Conversation and llm-intents. But the strongest lesson in the post is that model choice alone did not make the system good. Prompt design and tool routing mattered more. The author says weather, place lookup, search, and music control each needed explicit sections and output examples, and extra prompt work was required to remove emoji or unwanted chatter from spoken answers.

Another useful detail is the pragmatic use of automations instead of forcing the model to do everything. Music playback, for example, was routed through a sentence trigger such as Play {music} tied directly to Music Assistant. The wake word was also customized: the household settled on Hey Robot, and the author trained a microWakeWord model for it. The motivation started with privacy and avoiding cloud outages, but the conclusion is operational: the local setup became more enjoyable and more reliable for the core tasks that mattered.

That explains why the HN thread landed. This is one of the clearer public blueprints showing that local voice assistants can now be assembled from commodity components, careful prompt work, and a willingness to automate around weak spots instead of pretending they do not exist.

Primary source: Home Assistant community post. Community discussion: Hacker News.

Share: Long

Related Articles

LLM Reddit 5d ago 1 min read

A new llama.cpp change turns <code>--reasoning-budget</code> into a real sampler-side limit instead of a template stub. The LocalLLaMA thread focused on the tradeoff between cutting long think loops and preserving answer quality, especially for local Qwen 3.5 deployments.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.