Skip to content
Decaying

Parlor Shows Real-Time On-Device Multimodal Voice AI on Apple Silicon

Original: Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B View original →

Read in other languages: 한국어日本語
AI Apr 7, 2026 By Insights AI (HN) 2 min read 44 views Source

A recent Show HN thread pointed to Parlor, an on-device multimodal assistant that takes microphone audio and camera frames in the browser and replies with synthesized speech, all without a cloud API in the loop. The project uses Gemma 4 E2B for speech-and-vision understanding and Kokoro for text-to-speech.

The architecture is straightforward and practical. Audio PCM and JPEG frames stream over WebSocket to a FastAPI server. Gemma runs via LiteRT-LM on the GPU, Kokoro handles speech generation, and the browser plays back streamed audio while showing the transcript. The repo also calls out browser-side voice activity detection, barge-in so the user can interrupt mid-sentence, and sentence-level TTS streaming so playback can start before the full answer is complete.

The hardware bar is lower than many people expect from real-time multimodal AI. The README lists Apple Silicon macOS or Linux with a supported GPU, Python 3.12+, and roughly 3 GB of free RAM. On first run the app downloads about 2.6 GB for Gemma 4 E2B plus the TTS models. The author describes the project as a research preview and notes that a few months earlier similar real-time behavior would have required a far larger GPU budget.

Why it matters

Parlor is interesting because it packages several UX behaviors people usually associate with hosted assistants into a local stack that developers can inspect and run themselves. The published benchmark on an Apple M3 Pro reports about 1.8 to 2.2 seconds for speech-and-vision understanding, about 0.3 seconds for a short text response, and roughly 0.3 to 0.7 seconds for speech synthesis, with total end-to-end latency around 2.5 to 3.0 seconds.

  • Understanding model: Gemma 4 E2B via LiteRT-LM.
  • Speech model: Kokoro, with MLX on Mac and ONNX on Linux.
  • Claimed decode speed on Apple M3 Pro: roughly 83 tokens per second.

The larger takeaway is that local multimodal voice interfaces are moving from demo territory into reproducible developer projects. Parlor is still early, but it is a concrete example of how quickly laptop-scale AI stacks are improving.

Share: Long

Related Articles