LocalLLaMA paid attention to this post because it looked like real engineering cleanup instead of another inflated speed screenshot. On April 13, 2026, the author said a stock-MLX baseline for Qwen3.5-9B at 2048 tokens improved from 30.96 tok/s to 127.07 tok/s, with 89.36% acceptance and the full runtime released as open source.
#mlx
RSS FeedA fresh r/LocalLLaMA post published DFlash benchmarking on M5 Max with MLX 0.31.1 and reported 127.07 tok/s and a 4.13x speedup on Qwen3.5-9B. The most useful part is not the headline number but the post’s clear reproduction setup and bandwidth-bound interpretation.
A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.
A March 31, 2026 Hacker News hit brought attention to Ollama’s new MLX-based Apple Silicon runtime. The announcement combines MLX, NVFP4, and upgraded cache behavior to make local coding-agent workloads on macOS more practical.
Ollama used a March 30, 2026 preview to move its Apple Silicon path onto MLX. The release pairs higher prefill and decode throughput with NVFP4 support and cache changes aimed at coding and agent workflows.
A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.
A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.
A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.
A detailed r/LocalLLaMA experiment claims that copying layer blocks around 50-56% depth consistently hurts or collapses model quality across multiple architectures. The post stands out because it compares dense, hybrid, MoE, and transplant setups from a fully local MLX workflow.
A fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
A well-received r/LocalLLaMA experiment described tinyforge: Qwen 3.5 0.8B running on a MacBook Air, trained on 13 self-generated repair pairs from a test-feedback loop, with a reported holdout jump from 16/50 to 28/50.