Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro
Original: Flash-MoE: Running a 397B Parameter Model on a Laptop View original →
A March 22, 2026 Hacker News thread pushed Flash-MoE into wider view because it attacks a familiar assumption: that extremely large MoE models are automatically out of reach for consumer hardware. The linked GitHub repository and paper describe a pure C/Metal inference engine for Qwen3.5-397B-A17B running on a MacBook Pro with an Apple M3 Max and 48GB of unified memory. The headline result is 4.36 tokens per second in the 4-bit production configuration, with 5.74 tok/s reported for a more aggressive 2-bit mode. The author is explicit that the faster 2-bit path is not reliable enough for JSON output or tool calling, so the slower 4-bit setup is the practical one.
What the project is actually doing
Flash-MoE avoids the usual “model must fit in RAM” requirement by streaming expert weights from NVMe SSD on demand. According to the project documentation, the 397B-parameter model occupies 209GB on disk in the 4-bit configuration, but only the experts chosen by the router for each token are fetched and processed. The write-up says the model has 60 transformer layers, with 45 GatedDeltaNet linear-attention layers and 15 full-attention layers, and that each layer exposes 512 experts while only K=4 are activated per token. That sparse structure is the reason the system can trade storage bandwidth for working memory.
- The paper says only about 5.5GB of weights need to be resident at any one time while the rest stream from disk.
- The Metal path includes hand-written dequantized matrix-vector kernels, fused normalization and activation stages, and GPU-side MoE combine operations.
- The abstract claims that removing application-level caching and relying on the macOS page cache improved performance by 38% because it reduced memory-compressor pressure.
Why Hacker News cared
The community interest is not just the stunt value of “397B on a laptop.” The more important point is that Flash-MoE reframes the bottleneck. On modern consumer systems, the limit is often not raw parameter count, but the interaction between SSD throughput, unified-memory bandwidth, quantization error, and sparse routing. The project also reports that on Apple Silicon, SSD DMA and GPU compute compete for the same memory controller, so trying to overlap them aggressively hurts latency. That makes the serial GPU-to-SSD-to-GPU pipeline a deliberate systems choice rather than an implementation shortcut.
Limits and open questions
This is still a hardware-specific proof point, not a general deployment recipe. The implementation is tightly optimized for Metal and for the Qwen3.5-397B-A17B architecture, and the faster 2-bit mode clearly loses reliability on structured outputs. Even so, the experiment is useful because it shows how far sparse MoE inference can move when storage-aware execution is treated as a first-class design problem. For local-LLM builders, that is the real takeaway from the thread.
Sources
Related Articles
A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
Google DeepMind updated Gemini 3.1 Flash-Lite on March 3, 2026 as a low-cost model for high-volume, low-latency work. Google says it supports 128k input, 8k output, multimodal input, native audio generation, and pricing from $0.10 per 1M input tokens.
Comments (0)
No comments yet. Be the first to comment!