r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max
Original: Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results View original →
A new r/LocalLLaMA post published on March 30, 2026 is the kind of benchmark note the local inference community likes to see: fast numbers, but also a detailed account of what actually broke along the way. The author says an autoresearch loop running on a MacBook Pro with an M5 Max, 128GB unified memory, and a 40-core GPU pushed Qwen3.5-397B-A17B to 20.34 tokens per second in decode and 5.52 tokens per second in prefill. That is roughly 2x faster than the author's own starting point on the same machine and 4.67x Dan Woods' earlier 4.36 tok/s baseline on an M3 Max.
The post builds on flash-moe and Anemll's fork, which use a pure C/Metal engine to stream a 209GB model from SSD on Apple Silicon. According to the write-up, the biggest wins came from system-level changes rather than one magical kernel. Enabling 16 I/O threads with cache-io-split=4 added about 1.5 tok/s by spreading reads across SSD channels. Temporal expert prediction exploited 27% cross-token routing correlation for another 4.3 tok/s. Q3-GGUF experts delivered a smaller payload with better-than-expected perplexity trade-offs, while CMD2 pre-encode and a fused Q/K/V projection kernel shaved smaller but meaningful chunks of overhead from the Metal path.
Just as interesting is the failure log. The author says 78% of experiments were discarded. One-bit QJL quantization collapsed quality, ternary 2-bit sparsity failed, K=3 expert routing broke model behavior, and cross-layer prediction produced a 0% hit rate. Even the winning Q3 setup comes with caveats: long-form generation quality degraded, the evaluation used perplexity instead of broader benchmarks such as MMLU or GPQA, and the findings come from a single hardware platform. The post explicitly frames the work as speed research, not a production-quality claim.
There is also a useful architectural insight hiding inside the benchmark. Apple's Neural Engine reportedly sat idle at 0W throughout the run because dynamic MoE routing does not map cleanly to static precompiled ANE graphs. That leaves a large amount of theoretical compute stranded unless someone finds a clever way to use it during prefill or batching. The broader takeaway from r/LocalLLaMA is that very large local models are becoming a storage, scheduler, and kernel-optimization problem as much as a model problem, and transparent research logs like this are more valuable than a headline tokens-per-second number on their own.
Related Articles
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A Hacker News discussion highlighted Flash-MoE, a pure C/Metal inference stack that streams Qwen3.5-397B-A17B from SSD and reaches interactive speeds on a 48GB M3 Max laptop.
A Reddit post in r/LocalLLaMA introduces a GGUF release of Qwen3.5-122B-A10B Uncensored (Aggressive) alongside new K_P quants. The author claims 0/465 refusals and zero capability loss, but those results are presented as the author’s own tests rather than independent verification.
Comments (0)
No comments yet. Be the first to comment!