r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

A new r/LocalLLaMA post published on March 30, 2026 is the kind of benchmark note the local inference community likes to see: fast numbers, but also a detailed account of what actually broke along the way. The author says an autoresearch loop running on a MacBook Pro with an M5 Max, 128GB unified memory, and a 40-core GPU pushed Qwen3.5-397B-A17B to 20.34 tokens per second in decode and 5.52 tokens per second in prefill. That is roughly 2x faster than the author's own starting point on the same machine and 4.67x Dan Woods' earlier 4.36 tok/s baseline on an M3 Max.

The post builds on flash-moe and Anemll's fork, which use a pure C/Metal engine to stream a 209GB model from SSD on Apple Silicon. According to the write-up, the biggest wins came from system-level changes rather than one magical kernel. Enabling 16 I/O threads with cache-io-split=4 added about 1.5 tok/s by spreading reads across SSD channels. Temporal expert prediction exploited 27% cross-token routing correlation for another 4.3 tok/s. Q3-GGUF experts delivered a smaller payload with better-than-expected perplexity trade-offs, while CMD2 pre-encode and a fused Q/K/V projection kernel shaved smaller but meaningful chunks of overhead from the Metal path.

Just as interesting is the failure log. The author says 78% of experiments were discarded. One-bit QJL quantization collapsed quality, ternary 2-bit sparsity failed, K=3 expert routing broke model behavior, and cross-layer prediction produced a 0% hit rate. Even the winning Q3 setup comes with caveats: long-form generation quality degraded, the evaluation used perplexity instead of broader benchmarks such as MMLU or GPQA, and the findings come from a single hardware platform. The post explicitly frames the work as speed research, not a production-quality claim.

There is also a useful architectural insight hiding inside the benchmark. Apple's Neural Engine reportedly sat idle at 0W throughout the run because dynamic MoE routing does not map cleanly to static precompiled ANE graphs. That leaves a large amount of theoretical compute stranded unless someone finds a clever way to use it during prefill or batching. The broader takeaway from r/LocalLLaMA is that very large local models are becoming a storage, scheduler, and kernel-optimization problem as much as a model problem, and transparent research logs like this are more valuable than a headline tokens-per-second number on their own.

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

Related Articles

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

Related Articles

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
LLM Reddit Mar 12, 2026 2 min read

Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro
LLM Hacker News Mar 23, 2026 2 min read

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality
LLM Reddit Apr 1, 2026 2 min read