Hacker News Pushes an On-Device Voice AI Stack for Apple Silicon

Original: Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon View original →

Read in other languages: 한국어日本語
LLM Mar 11, 2026 By Insights AI (HN) 2 min read 3 views Source

Why the Launch HN post is landing

YC W26 founders Sanchit and Shubham used Launch HN to introduce MetalRT and RCLI as one coherent story: a local voice AI stack for macOS that handles STT, LLM inference, TTS, and document retrieval without routing requests through cloud APIs. The RCLI repository frames the product in exactly those terms, promising a way to talk to your Mac, query local documents, and keep the full pipeline on-device. That makes the launch more interesting than a narrow benchmark drop, because it ties raw inference speed to an end-user workflow.

The HN post is concrete about the performance claim. It says users can reproduce the numbers with rcli bench on an M4 Max with 64 GB RAM, and lists 658 tok/s for Qwen3-0.6B, 186 tok/s for Qwen3-4B, and 570 tok/s for LFM2.5-1.2B. The same launch also reports a 6.6 ms time-to-first-token, 101 ms to transcribe 70 seconds of audio, and 178 ms for TTS synthesis. Those figures are presented against llama.cpp, Apple MLX, and sherpa-onnx, which is part of why the thread drew attention: the post is arguing not only that local AI is viable, but that a tightly optimized Apple-Silicon-first stack can beat widely used open alternatives on multiple stages of the pipeline.

What the product actually offers

  • RCLI exposes 38 macOS actions through voice or text commands.
  • The README advertises local RAG with roughly 4 ms retrieval over 5K+ chunks.
  • MetalRT is positioned as a single GPU runtime for LLM, STT, and TTS on Apple Silicon.
  • The install path falls back to llama.cpp on M1/M2 while reserving the best path for M3 and newer chips.

That packaging matters because voice AI is usually where local inference pipelines fall apart. STT, LLM generation, and TTS are sequential, so latency compounds fast. The Launch HN write-up says the team attacked that by using custom Metal shaders, pre-allocated memory, and one unified engine instead of stitching together separate runtimes. The public repository reinforces the same point with sub-200ms end-to-end latency claims, hot-swappable models, local actions, and a terminal UI that surfaces timing data instead of hiding it.

There are still important tradeoffs. The open-source layer is RCLI under MIT, but MetalRT itself is distributed under a proprietary license, and the fastest route is clearly tuned for newer Apple Silicon hardware. That is exactly why the thread is interesting: it is less a generic AI launch than a live argument that vertical optimization, privacy, and local control can justify a more opinionated stack than the usual cloud-first tooling.

Source: RunAnywhereAI/RCLI. Community discussion: Hacker News thread.

Share:

Related Articles

LLM Reddit 15h ago 2 min read

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

LLM sources.twitter 1d ago 2 min read

NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.