Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

Original: Flash-MoE: Running a 397B Parameter Model on a Laptop View original →

Read in other languages: 한국어日本語
LLM Mar 23, 2026 By Insights AI (HN) 2 min read 1 views Source

A March 22, 2026 Hacker News thread pushed Flash-MoE into wider view because it attacks a familiar assumption: that extremely large MoE models are automatically out of reach for consumer hardware. The linked GitHub repository and paper describe a pure C/Metal inference engine for Qwen3.5-397B-A17B running on a MacBook Pro with an Apple M3 Max and 48GB of unified memory. The headline result is 4.36 tokens per second in the 4-bit production configuration, with 5.74 tok/s reported for a more aggressive 2-bit mode. The author is explicit that the faster 2-bit path is not reliable enough for JSON output or tool calling, so the slower 4-bit setup is the practical one.

What the project is actually doing

Flash-MoE avoids the usual “model must fit in RAM” requirement by streaming expert weights from NVMe SSD on demand. According to the project documentation, the 397B-parameter model occupies 209GB on disk in the 4-bit configuration, but only the experts chosen by the router for each token are fetched and processed. The write-up says the model has 60 transformer layers, with 45 GatedDeltaNet linear-attention layers and 15 full-attention layers, and that each layer exposes 512 experts while only K=4 are activated per token. That sparse structure is the reason the system can trade storage bandwidth for working memory.

  • The paper says only about 5.5GB of weights need to be resident at any one time while the rest stream from disk.
  • The Metal path includes hand-written dequantized matrix-vector kernels, fused normalization and activation stages, and GPU-side MoE combine operations.
  • The abstract claims that removing application-level caching and relying on the macOS page cache improved performance by 38% because it reduced memory-compressor pressure.

Why Hacker News cared

The community interest is not just the stunt value of “397B on a laptop.” The more important point is that Flash-MoE reframes the bottleneck. On modern consumer systems, the limit is often not raw parameter count, but the interaction between SSD throughput, unified-memory bandwidth, quantization error, and sparse routing. The project also reports that on Apple Silicon, SSD DMA and GPU compute compete for the same memory controller, so trying to overlap them aggressively hurts latency. That makes the serial GPU-to-SSD-to-GPU pipeline a deliberate systems choice rather than an implementation shortcut.

Limits and open questions

This is still a hardware-specific proof point, not a general deployment recipe. The implementation is tightly optimized for Metal and for the Qwen3.5-397B-A17B architecture, and the faster 2-bit mode clearly loses reliability on structured outputs. Even so, the experiment is useful because it shows how far sparse MoE inference can move when storage-aware execution is treated as a first-class design problem. For local-LLM builders, that is the real takeaway from the thread.

Sources

Share: Long

Related Articles

LLM Reddit Mar 12, 2026 2 min read

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.