Reddit showcases Parlor, a real-time local voice-and-vision assistant powered by Gemma 4 E2B
Original: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B View original →
A LocalLLaMA demo post highlighted Parlor, an open-source project that pushes a full voice-and-vision conversation loop onto the local machine. The setup uses Gemma 4 E2B for multimodal understanding and Kokoro for text-to-speech, letting users talk to the system, show it camera input, and hear responses back without relying on a remote inference API.
The repository describes a straightforward but practical pipeline. The browser captures microphone audio and camera frames, then streams PCM audio and JPEG images over WebSocket to a FastAPI server. On the backend, LiteRT-LM runs Gemma 4 E2B on GPU for speech-and-vision understanding, while Kokoro handles speech synthesis. The frontend also includes browser-side Voice Activity Detection, barge-in support so the user can interrupt the assistant mid-response, and sentence-level TTS streaming so playback starts before the full answer is finished.
Published numbers matter here
What makes the project notable is that it comes with concrete performance claims instead of just a demo clip. On an Apple M3 Pro, the README reports roughly 1.8-2.2 seconds for speech and vision understanding, around 0.3 seconds to generate a response of about 25 tokens, and another 0.3-0.7 seconds for text-to-speech. That adds up to about 2.5-3.0 seconds end to end, with a decode speed near 83 tokens per second. Hardware requirements are also modest by current standards: Python 3.12+, Apple Silicon or a supported Linux GPU, and roughly 3 GB of free RAM for the model.
The maintainer is careful to label Parlor a “research preview,” which is the right caveat. This is not presented as a polished consumer assistant, and rough edges are expected. Even so, the project points to a meaningful shift. Multimodal interaction that recently felt tied to expensive cloud inference is now reaching the range where local devices can handle it for narrow but useful tasks.
A stronger case for small multimodal models
The author frames language learning as a compelling use case, and that seems right. Fast local turn-taking, camera grounding, and multilingual fallback are exactly the kinds of features that benefit from lower latency and lower operating cost. More broadly, the LocalLLaMA reaction shows why developers care about projects like this: they make “edge AI” feel less like a slogan and more like a buildable product category.
Related Articles
Liquid AI's new LFM2.5 8B-A1B MoE model delivers 253 tokens/s on M5 Max, runs under 6GB memory on mobile, and achieves 18,500 output tokens/s on H100—all while outperforming similarly-sized dense models on key benchmarks.
The thread’s energy centered on the architecture claim: what does “encoder-free” really mean for a 12B multimodal model?
Local multimodal AI is moving into the 12B class. Google Gemma introduced Gemma 4 12B under Apache 2.0, describing a unified encoder-free design for image, audio, and text inputs.