Show HN: Building a Sub-500ms Latency Voice Agent from Scratch
Original: Show HN: I built a sub-500ms latency voice agent from scratch View original →
400ms Voice AI: What It Takes
Developer Nick Tikhonov shared a Show HN project (122 upvotes) detailing how he built a voice agent averaging ~400ms end-to-end latency — from phone stop to first syllable — with a complete STT → LLM → TTS pipeline, clean barge-ins, and no precomputed responses.
What Actually Moved the Needle
- Semantic End-of-Turn Detection: VAD alone fails for natural conversation. You need semantic understanding of when someone is truly done speaking
- Streaming is Non-Negotiable: Sequential pipelines are dead on arrival. STT → LLM → TTS must all stream
- TTFT Dominates: Groq's ~80ms time-to-first-token was the single biggest performance win
- Geography Over Prompts: Colocating all components mattered more than any prompt optimization
The Core Loop
The system reduces to two states — speaking vs. listening — and two critical transitions: cancel instantly on barge-in, respond instantly on end-of-turn. These transitions define the entire user experience. Voice is fundamentally a turn-taking problem, not a transcription problem.
Open Source
The project is available on GitHub as 'shuo'. For developers building real-time voice AI systems, this implementation offers a practical, battle-tested reference for achieving sub-500ms conversational latency.
Related Articles
The HN discussion focused less on whether AI feels impressive and more on whether the infrastructure math can keep working. Ed Zitron’s essay frames the slowdown question as a financing and revenue problem.
Defense AI startup Anduril Industries raised $5B in a Series H co-led by Thrive Capital and a16z, doubling its valuation to $61B. The company doubled revenue to $2.2B in 2025 and will fund Arsenal-1 manufacturing expansion and Lattice platform advancement.
Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that trains Claude to translate its own internal activations into human-readable text—enabling safety audits that can uncover hidden model motivations.