Running Nvidia PersonaPlex 7B in Swift on Apple Silicon moves local voice agents closer to real time

Hacker News discussion: https://news.ycombinator.com/item?id=47258801
Primary source: Ivan Campos on PersonaPlex 7B

This HN post points to a practical experiment that feels more useful than another benchmark chart: getting NVIDIA Research’s PersonaPlex 7B speech-to-speech stack running natively in Swift on Apple Silicon through MLX. The model is interesting on its own because PersonaPlex is designed for full-duplex voice interaction rather than a simple speech-to-text plus text-to-speech chain. The blog post is even more interesting because it focuses on what had to change to make that setup usable on a real local machine.

What the port changed

A heartbeat chunking system emits audio every 0.5 seconds to reduce dead air and repetitive sentence endings.
Passthrough mode and realtime preview let the assistant voice appear immediately and support interruption.
The audio stack was rebuilt around a ring buffer, dynamic chunk dropping, and session management for multi-user or multi-agent use.

The author says the early port was stuck with 3 to 4 second latency, which is too slow for natural turn-taking. After the pipeline changes, the M4 Pro demo reaches about 1.3x real-time output from the 1.5B encoder stage and roughly a 0.4 latency factor between the end of user speech and the start of assistant speech. The remaining weak point is turn detection: the voice activity logic still needs more tuning outside the original Python reference environment.

The reason this matters for AI builders is simple. Local voice agents are finally moving from “works in a demo” toward “usable in conversation.” The post is a good reminder that the hard part is not only model quality. Streaming, buffering, interruption, and the ergonomics of the full audio loop are what decide whether speech-to-speech systems feel natural.

LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.

#mlx #llama.cpp #apple-silicon

LLM Reddit Mar 18, 2026 2 min read

r/MachineLearning highlights mlx-tune for Apple Silicon LLM fine-tuning with an Unsloth-style API

A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.

#apple-silicon #mlx #fine-tuning

LLM Reddit Mar 28, 2026 2 min read

r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed

A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.

#mlx #kv-cache #metal

What the port changed

Related Articles

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

r/MachineLearning highlights mlx-tune for Apple Silicon LLM fine-tuning with an Unsloth-style API

r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed