Parlor Shows Real-Time On-Device Multimodal Voice AI on Apple Silicon
Original: Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B View original →
A recent Show HN thread pointed to Parlor, an on-device multimodal assistant that takes microphone audio and camera frames in the browser and replies with synthesized speech, all without a cloud API in the loop. The project uses Gemma 4 E2B for speech-and-vision understanding and Kokoro for text-to-speech.
The architecture is straightforward and practical. Audio PCM and JPEG frames stream over WebSocket to a FastAPI server. Gemma runs via LiteRT-LM on the GPU, Kokoro handles speech generation, and the browser plays back streamed audio while showing the transcript. The repo also calls out browser-side voice activity detection, barge-in so the user can interrupt mid-sentence, and sentence-level TTS streaming so playback can start before the full answer is complete.
The hardware bar is lower than many people expect from real-time multimodal AI. The README lists Apple Silicon macOS or Linux with a supported GPU, Python 3.12+, and roughly 3 GB of free RAM. On first run the app downloads about 2.6 GB for Gemma 4 E2B plus the TTS models. The author describes the project as a research preview and notes that a few months earlier similar real-time behavior would have required a far larger GPU budget.
Why it matters
Parlor is interesting because it packages several UX behaviors people usually associate with hosted assistants into a local stack that developers can inspect and run themselves. The published benchmark on an Apple M3 Pro reports about 1.8 to 2.2 seconds for speech-and-vision understanding, about 0.3 seconds for a short text response, and roughly 0.3 to 0.7 seconds for speech synthesis, with total end-to-end latency around 2.5 to 3.0 seconds.
- Understanding model: Gemma 4 E2B via LiteRT-LM.
- Speech model: Kokoro, with MLX on Mac and ONNX on Linux.
- Claimed decode speed on Apple M3 Pro: roughly 83 tokens per second.
The larger takeaway is that local multimodal voice interfaces are moving from demo territory into reproducible developer projects. Parlor is still early, but it is a concrete example of how quickly laptop-scale AI stacks are improving.
Related Articles
Together AI said on April 3, 2026 that Wan 2.7 from Alibaba Cloud is now available on its platform. The accompanying product post says text-to-video is live now, with image-to-video, reference-to-video, and video edit workflows rolling out on the same API, auth, and billing surface.
xAI used a recent X thread to spell out one of the capability upgrades behind Grok Imagine’s Quality mode. The company says the mode improves world knowledge and prompt understanding, letting the image model better interpret complex scenes, physics, object relationships, and specific cultural or brand references.
Google says Cinematic Video Overviews are rolling out to NotebookLM Ultra users in English. The company says the feature combines Gemini 3, Nano Banana Pro, and Veo 3 to generate more immersive videos than the earlier narrated-slide format.
Comments (0)
No comments yet. Be the first to comment!