Hacker News Pushes an On-Device Voice AI Stack for Apple Silicon

Why the Launch HN post is landing

YC W26 founders Sanchit and Shubham used Launch HN to introduce MetalRT and RCLI as one coherent story: a local voice AI stack for macOS that handles STT, LLM inference, TTS, and document retrieval without routing requests through cloud APIs. The RCLI repository frames the product in exactly those terms, promising a way to talk to your Mac, query local documents, and keep the full pipeline on-device. That makes the launch more interesting than a narrow benchmark drop, because it ties raw inference speed to an end-user workflow.

The HN post is concrete about the performance claim. It says users can reproduce the numbers with rcli bench on an M4 Max with 64 GB RAM, and lists 658 tok/s for Qwen3-0.6B, 186 tok/s for Qwen3-4B, and 570 tok/s for LFM2.5-1.2B. The same launch also reports a 6.6 ms time-to-first-token, 101 ms to transcribe 70 seconds of audio, and 178 ms for TTS synthesis. Those figures are presented against llama.cpp, Apple MLX, and sherpa-onnx, which is part of why the thread drew attention: the post is arguing not only that local AI is viable, but that a tightly optimized Apple-Silicon-first stack can beat widely used open alternatives on multiple stages of the pipeline.

What the product actually offers

RCLI exposes 38 macOS actions through voice or text commands.
The README advertises local RAG with roughly 4 ms retrieval over 5K+ chunks.
MetalRT is positioned as a single GPU runtime for LLM, STT, and TTS on Apple Silicon.
The install path falls back to llama.cpp on M1/M2 while reserving the best path for M3 and newer chips.

That packaging matters because voice AI is usually where local inference pipelines fall apart. STT, LLM generation, and TTS are sequential, so latency compounds fast. The Launch HN write-up says the team attacked that by using custom Metal shaders, pre-allocated memory, and one unified engine instead of stitching together separate runtimes. The public repository reinforces the same point with sub-200ms end-to-end latency claims, hot-swappable models, local actions, and a terminal UI that surfaces timing data instead of hiding it.

There are still important tradeoffs. The open-source layer is RCLI under MIT, but MetalRT itself is distributed under a proprietary license, and the fastest route is clearly tuned for newer Apple Silicon hardware. That is exactly why the thread is interesting: it is less a generic AI launch than a live argument that vertical optimization, privacy, and local control can justify a more opinionated stack than the usual cloud-first tooling.

Source: RunAnywhereAI/RCLI. Community discussion: Hacker News thread.

Hacker News Pushes an On-Device Voice AI Stack for Apple Silicon

Why the Launch HN post is landing

What the product actually offers

Related Articles

Hacker News Highlights RunAnywhere's Local Voice AI Stack for Apple Silicon

Google’s Agentic RAG keeps searching until enterprise answers hold up

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

Related Articles

Hacker News Highlights RunAnywhere's Local Voice AI Stack for Apple Silicon
LLM Hacker News Mar 11, 2026 2 min read

Google’s Agentic RAG keeps searching until enterprise answers hold up

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
LLM Reddit Mar 12, 2026 2 min read