Parlor Shows Real-Time On-Device Multimodal Voice AI on Apple Silicon

A recent Show HN thread pointed to Parlor, an on-device multimodal assistant that takes microphone audio and camera frames in the browser and replies with synthesized speech, all without a cloud API in the loop. The project uses Gemma 4 E2B for speech-and-vision understanding and Kokoro for text-to-speech.

The architecture is straightforward and practical. Audio PCM and JPEG frames stream over WebSocket to a FastAPI server. Gemma runs via LiteRT-LM on the GPU, Kokoro handles speech generation, and the browser plays back streamed audio while showing the transcript. The repo also calls out browser-side voice activity detection, barge-in so the user can interrupt mid-sentence, and sentence-level TTS streaming so playback can start before the full answer is complete.

The hardware bar is lower than many people expect from real-time multimodal AI. The README lists Apple Silicon macOS or Linux with a supported GPU, Python 3.12+, and roughly 3 GB of free RAM. On first run the app downloads about 2.6 GB for Gemma 4 E2B plus the TTS models. The author describes the project as a research preview and notes that a few months earlier similar real-time behavior would have required a far larger GPU budget.

Why it matters

Parlor is interesting because it packages several UX behaviors people usually associate with hosted assistants into a local stack that developers can inspect and run themselves. The published benchmark on an Apple M3 Pro reports about 1.8 to 2.2 seconds for speech-and-vision understanding, about 0.3 seconds for a short text response, and roughly 0.3 to 0.7 seconds for speech synthesis, with total end-to-end latency around 2.5 to 3.0 seconds.

Understanding model: Gemma 4 E2B via LiteRT-LM.
Speech model: Kokoro, with MLX on Mac and ONNX on Linux.
Claimed decode speed on Apple M3 Pro: roughly 83 tokens per second.

The larger takeaway is that local multimodal voice interfaces are moving from demo territory into reproducible developer projects. Parlor is still early, but it is a concrete example of how quickly laptop-scale AI stacks are improving.

Parlor Shows Real-Time On-Device Multimodal Voice AI on Apple Silicon

Why it matters

Related Articles

Apple SpeechAnalyzer beats Whisper Small in an on-device benchmark

AI model rivalry shifts from benchmark charts to token bills

Australia puts AI, data centers and copyright under one policy roof

Related Articles

Apple SpeechAnalyzer beats Whisper Small in an on-device benchmark

AI model rivalry shifts from benchmark charts to token bills

Australia puts AI, data centers and copyright under one policy roof