#apple-silicon

LLM Reddit Mar 28, 2026 2 min read

r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed

A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.

#mlx #kv-cache #metal

LLM Reddit Mar 27, 2026 2 min read

LocalLLaMA Highlights a Sparse V Dequant Trick for TurboQuant in llama.cpp

A LocalLLaMA self-post shared an open-source TurboQuant implementation for llama.cpp that skips value dequantization when attention weights are negligible. The author reports a 22.8% decode gain at 32K context on Qwen3.5-35B-A3B over Apple M5 Max, with unchanged perplexity and better needle-in-a-haystack retrieval.

#llm-inference #kv-cache #llama-cpp

LLM Hacker News Mar 25, 2026 2 min read

Hacker News spots Hypura running oversized LLMs on Macs with tier-aware scheduling

Hacker News noticed Hypura because it treats Apple Silicon memory limits as a scheduling problem, spreading tensors across GPU, RAM, and NVMe instead of letting oversized models crash.

#apple-silicon #llm-inference #memory-scheduling

LLM Hacker News Mar 23, 2026 2 min read

Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

A Hacker News discussion highlighted Flash-MoE, a pure C/Metal inference stack that streams Qwen3.5-397B-A17B from SSD and reaches interactive speeds on a 48GB M3 Max laptop.

#llm #mixture-of-experts #metal

LLM Reddit Mar 23, 2026 2 min read

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.

#apple-silicon #llama.cpp #mlx

LLM Reddit Mar 18, 2026 2 min read

r/MachineLearning highlights mlx-tune for Apple Silicon LLM fine-tuning with an Unsloth-style API

A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.

#apple-silicon #mlx #fine-tuning

LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable

A fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.

#apple-silicon #benchmarks #omlx

LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.

#mlx #llama.cpp #apple-silicon

LLM Reddit Mar 12, 2026 2 min read

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

#llama.cpp #qwen #apple-silicon

LLM Hacker News Mar 11, 2026 2 min read

Hacker News Pushes an On-Device Voice AI Stack for Apple Silicon

A Launch HN thread pulled RunAnywhere’s MetalRT and RCLI into focus, centering attention on a low-latency STT-LLM-TTS stack that runs on Apple Silicon without cloud APIs.

#apple-silicon #on-device-ai #voice-ai

LLM Hacker News Mar 11, 2026 2 min read

Hacker News Highlights RunAnywhere's Local Voice AI Stack for Apple Silicon

A Launch HN thread pushed RunAnywhere's RCLI into view as an Apple Silicon-first macOS voice AI stack that combines STT, LLM, TTS, local RAG, and 38 system actions without relying on cloud APIs.

#apple-silicon #local-ai #voice-ai

LLM Reddit Mar 10, 2026 2 min read

r/LocalLLaMA Tests Qwen 3.5 9B as a Real Local Agent on an M1 Pro

A high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.

#qwen #local-llm #ollama