Flash-MoE: Running a 397B Parameter Model on a Laptop

The Hacker News thread was at 194 points and 68 comments when this crawl ran. The linked project, Flash-MoE, describes a pure C and Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro with 48 GB of unified memory at 4.4+ tokens per second. The author frames it as more than a stunt, explicitly targeting production-quality output with tool calling rather than a fragile benchmark-only demo.

The most important design choice is memory handling. The repository says the full model occupies 209 GB on disk, so the system does not attempt to keep all experts resident in RAM. Instead, it streams expert weights from SSD on demand and relies on sparse activation so that only the K=4 active experts for each layer need to be loaded per token. The current best path is the 4-bit expert configuration with an FMA-optimized kernel at 4.36 tok/s. A 2-bit route can push higher throughput, but the README says it breaks JSON output and makes tool calling unreliable, so 4-bit is treated as the real production configuration.

The architecture details are unusually specific. Flash-MoE describes a 60-layer transformer with 45 GatedDeltaNet linear-attention layers and 15 standard full-attention layers. Each layer contains 512 experts, with K=4 experts activated per token plus one shared expert. That is exactly the kind of sparsity pattern that makes a very large MoE model plausible on a small machine: the total parameter count is huge, but the active working set per step is much smaller than a dense 397B model would require.

The implementation is also notable for how low-level it is. The project lists hand-tuned Metal compute shaders, an FMA-oriented dequantization kernel, Accelerate BLAS for the linear-attention recurrence, and a deliberate "Trust the OS" strategy that leans on macOS page cache instead of inventing a separate expert cache. According to the README, that page cache reaches roughly 71% hit rate naturally and outperformed several custom caching ideas that were tested and discarded.

Hardware target: MacBook Pro M3 Max, 48 GB unified memory, 1 TB SSD
Best reported mode: 4-bit experts with FMA kernel at 4.36 tok/s
Main tradeoff: 2-bit is faster but not reliable enough for JSON and tool use

Flash-MoE matters because it pushes against the assumption that very large frontier-scale models are automatically synonymous with very large servers. It is still a niche engineering project tied to specific Apple hardware and a carefully optimized software stack, not a turnkey consumer app. But as a demonstration of what sparse MoE, SSD streaming, and aggressively tuned local kernels can do together, it materially advances the ceiling for laptop-class local inference experiments.

Flash-MoE: Running a 397B Parameter Model on a Laptop

Related Articles

Unsloth Studio beta goes after the local model workflow in one interface

Alibaba Releases Qwen3.5 Small Models: 9B Achieves GPT-oss 20B–120B Level Performance

LocalLLaMA Tracks NVIDIA’s Nemotron License Change and What It Means for Derivative Models

Comments (0)

Leave a Comment

Related Articles

Unsloth Studio beta goes after the local model workflow in one interface

Alibaba Releases Qwen3.5 Small Models: 9B Achieves GPT-oss 20B–120B Level Performance
LLM Reddit Mar 2, 2026 1 min read

LocalLLaMA Tracks NVIDIA’s Nemotron License Change and What It Means for Derivative Models