Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

A March 22, 2026 Hacker News thread pushed Flash-MoE into wider view because it attacks a familiar assumption: that extremely large MoE models are automatically out of reach for consumer hardware. The linked GitHub repository and paper describe a pure C/Metal inference engine for Qwen3.5-397B-A17B running on a MacBook Pro with an Apple M3 Max and 48GB of unified memory. The headline result is 4.36 tokens per second in the 4-bit production configuration, with 5.74 tok/s reported for a more aggressive 2-bit mode. The author is explicit that the faster 2-bit path is not reliable enough for JSON output or tool calling, so the slower 4-bit setup is the practical one.

What the project is actually doing

Flash-MoE avoids the usual “model must fit in RAM” requirement by streaming expert weights from NVMe SSD on demand. According to the project documentation, the 397B-parameter model occupies 209GB on disk in the 4-bit configuration, but only the experts chosen by the router for each token are fetched and processed. The write-up says the model has 60 transformer layers, with 45 GatedDeltaNet linear-attention layers and 15 full-attention layers, and that each layer exposes 512 experts while only K=4 are activated per token. That sparse structure is the reason the system can trade storage bandwidth for working memory.

The paper says only about 5.5GB of weights need to be resident at any one time while the rest stream from disk.
The Metal path includes hand-written dequantized matrix-vector kernels, fused normalization and activation stages, and GPU-side MoE combine operations.
The abstract claims that removing application-level caching and relying on the macOS page cache improved performance by 38% because it reduced memory-compressor pressure.

Why Hacker News cared

The community interest is not just the stunt value of “397B on a laptop.” The more important point is that Flash-MoE reframes the bottleneck. On modern consumer systems, the limit is often not raw parameter count, but the interaction between SSD throughput, unified-memory bandwidth, quantization error, and sparse routing. The project also reports that on Apple Silicon, SSD DMA and GPU compute compete for the same memory controller, so trying to overlap them aggressively hurts latency. That makes the serial GPU-to-SSD-to-GPU pipeline a deliberate systems choice rather than an implementation shortcut.

Limits and open questions

This is still a hardware-specific proof point, not a general deployment recipe. The implementation is tightly optimized for Metal and for the Qwen3.5-397B-A17B architecture, and the faster 2-bit mode clearly loses reliability on structured outputs. Even so, the experiment is useful because it shows how far sparse MoE inference can move when storage-aware execution is treated as a first-class design problem. For local-LLM builders, that is the real takeaway from the thread.

Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

What the project is actually doing

Why Hacker News cared

Limits and open questions

Sources

Related Articles

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

r/MachineLearning highlights mlx-tune for Apple Silicon LLM fine-tuning with an Unsloth-style API

Related Articles

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac
LLM Reddit Mar 12, 2026 2 min read

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max
LLM Reddit Mar 30, 2026 2 min read

r/MachineLearning highlights mlx-tune for Apple Silicon LLM fine-tuning with an Unsloth-style API
LLM Reddit Mar 18, 2026 2 min read