#apple-silicon

AI Hacker News Apr 20, 2026 2 min read

Zero-copy Wasm-to-GPU inference made HN ask where the speedup really is

HN found this interesting because it tests a real boundary: whether Apple Silicon unified memory can make a Wasm sandbox and a GPU buffer operate on the same bytes.

#wasm #gpu #inference

LLM Hacker News Apr 16, 2026 2 min read

Darkbloom Pitches Idle Macs for Private Inference, and HN Tests the Trust Model

HN liked the ambition but went straight for the weak points: marketplace demand, MDM trust, Mac privacy claims, and whether the operator economics are believable. Darkbloom says idle Apple Silicon can serve OpenAI-compatible private inference at lower cost; the thread treated that as an architecture and incentives problem, not just a landing page.

#private-inference #apple-silicon #distributed-ai

LLM Reddit Apr 14, 2026 2 min read

Reddit Spots an Open-Source DFlash Runtime That Pushes Qwen3.5 to 4x Speeds on Apple Silicon

LocalLLaMA paid attention to this post because it looked like real engineering cleanup instead of another inflated speed screenshot. On April 13, 2026, the author said a stock-MLX baseline for Qwen3.5-9B at 2048 tokens improved from 30.96 tok/s to 127.07 tok/s, with 89.36% acceptance and the full runtime released as open source.

#dflash #speculative-decoding #mlx

LLM Reddit Apr 13, 2026 2 min read

r/LocalLLaMA tests lossless speculative decoding on Apple Silicon with DFlash and MLX

A fresh r/LocalLLaMA post published DFlash benchmarking on M5 Max with MLX 0.31.1 and reported 127.07 tok/s and a 4.13x speedup on Qwen3.5-9B. The most useful part is not the headline number but the post’s clear reproduction setup and bandwidth-bound interpretation.

#mlx #apple-silicon #speculative-decoding

LLM Reddit Apr 11, 2026 2 min read

LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.

#apple-silicon #mlx #speculative-decoding

LLM Reddit Apr 7, 2026 2 min read

A LocalLLaMA Benchmark Suggests MoE Models Fit 32 GB Apple Laptops Well

A recent LocalLLaMA discussion shared results from Mac LLM Bench, an open benchmark workflow for Apple Silicon systems. The most useful takeaway is practical: dense 32B models hit a clear wall on a 32 GB MacBook Air M5, while some MoE models offer a much better latency-to-capability tradeoff.

#apple-silicon #benchmark #llama-cpp

AI Hacker News Apr 7, 2026 2 min read

Parlor Shows Real-Time On-Device Multimodal Voice AI on Apple Silicon

A recent Show HN thread pointed to Parlor, a local multimodal assistant that combines Gemma 4 E2B, Kokoro, browser voice activity detection, and streaming audio playback. The project reports around 2.5 to 3.0 seconds of end-to-end latency on an Apple M3 Pro.

#multimodal #on-device-ai #gemma

LLM Reddit Apr 6, 2026 2 min read

Reddit showcases Parlor, a real-time local voice-and-vision assistant powered by Gemma 4 E2B

A LocalLLaMA demo pointed to Parlor, which runs speech and vision understanding with Gemma 4 E2B and uses Kokoro for text-to-speech, all on-device. The README reports roughly 2.5-3.0 seconds end-to-end latency and about 83 tokens/sec decode speed on an Apple M3 Pro.

#llm #multimodal #edge-ai

LLM Hacker News Apr 1, 2026 2 min read

Ollama previews MLX-powered Apple Silicon runtime

A March 31, 2026 Hacker News hit brought attention to Ollama’s new MLX-based Apple Silicon runtime. The announcement combines MLX, NVFP4, and upgraded cache behavior to make local coding-agent workloads on macOS more practical.

#ollama #mlx #apple-silicon

LLM Reddit Mar 31, 2026 2 min read

LocalLLaMA Flags an Experimental Apple Neural Engine Backend for llama.cpp

A March 30, 2026 r/LocalLLaMA post pointed to an experimental ggml backend that sends matrix work to Apple’s Neural Engine. The prototype is not upstream, but it is one of the clearest signs yet that developers are treating ANE as a serious local inference target.

#llama.cpp #apple-silicon #ane

LLM Hacker News Mar 31, 2026 1 min read

Ollama’s MLX Preview Pushes Local LLM Performance on Apple Silicon

Ollama used a March 30, 2026 preview to move its Apple Silicon path onto MLX. The release pairs higher prefill and decode throughput with NVFP4 support and cache changes aimed at coding and agent workflows.

#ollama #mlx #apple-silicon

LLM Reddit Mar 30, 2026 2 min read

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.

#qwen #apple-silicon #inference