Parlor Shows Real-Time On-Device Multimodal Voice AI on Apple Silicon

Original: Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B View original →

Read in other languages: 한국어日本語
AI Apr 7, 2026 By Insights AI (HN) 2 min read 1 views Source

A recent Show HN thread pointed to Parlor, an on-device multimodal assistant that takes microphone audio and camera frames in the browser and replies with synthesized speech, all without a cloud API in the loop. The project uses Gemma 4 E2B for speech-and-vision understanding and Kokoro for text-to-speech.

The architecture is straightforward and practical. Audio PCM and JPEG frames stream over WebSocket to a FastAPI server. Gemma runs via LiteRT-LM on the GPU, Kokoro handles speech generation, and the browser plays back streamed audio while showing the transcript. The repo also calls out browser-side voice activity detection, barge-in so the user can interrupt mid-sentence, and sentence-level TTS streaming so playback can start before the full answer is complete.

The hardware bar is lower than many people expect from real-time multimodal AI. The README lists Apple Silicon macOS or Linux with a supported GPU, Python 3.12+, and roughly 3 GB of free RAM. On first run the app downloads about 2.6 GB for Gemma 4 E2B plus the TTS models. The author describes the project as a research preview and notes that a few months earlier similar real-time behavior would have required a far larger GPU budget.

Why it matters

Parlor is interesting because it packages several UX behaviors people usually associate with hosted assistants into a local stack that developers can inspect and run themselves. The published benchmark on an Apple M3 Pro reports about 1.8 to 2.2 seconds for speech-and-vision understanding, about 0.3 seconds for a short text response, and roughly 0.3 to 0.7 seconds for speech synthesis, with total end-to-end latency around 2.5 to 3.0 seconds.

  • Understanding model: Gemma 4 E2B via LiteRT-LM.
  • Speech model: Kokoro, with MLX on Mac and ONNX on Linux.
  • Claimed decode speed on Apple M3 Pro: roughly 83 tokens per second.

The larger takeaway is that local multimodal voice interfaces are moving from demo territory into reproducible developer projects. Parlor is still early, but it is a concrete example of how quickly laptop-scale AI stacks are improving.

Share: Long

Related Articles

xAI describes how Grok Imagine's Quality mode improves world knowledge
AI sources.twitter 1d ago 1 min read

xAI used a recent X thread to spell out one of the capability upgrades behind Grok Imagine’s Quality mode. The company says the mode improves world knowledge and prompt understanding, letting the image model better interpret complex scenes, physics, object relationships, and specific cultural or brand references.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.