#metal

LLM Reddit Mar 30, 2026 2 min read

r/LocalLLaMA Details an Autoresearch Push to 20.34 tok/s for Qwen3.5-397B on M5 Max

A new r/LocalLLaMA benchmark post says an M5 Max system pushed Qwen3.5-397B to 20.34 tok/s through SSD streaming, with I/O parallelism, temporal expert prediction, and Q3-GGUF experts doing most of the work.

#qwen #apple-silicon #inference

LLM Reddit Mar 28, 2026 2 min read

r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed

A March 28, 2026 r/LocalLLaMA post turned TurboQuant from a paper topic into an MLX implementation story with custom Metal kernels, code, and an upstream PR. The author reports 4.6x KV cache compression at 0.98x FP16 speed on Qwen2.5-32B, but the repository's 7B README numbers are more conservative, underscoring how model choice and integration details shape the real payoff.

#mlx #kv-cache #metal

LLM Hacker News Mar 23, 2026 2 min read

Flash-MoE Shows 397B Qwen Inference on a 48GB MacBook Pro

A Hacker News discussion highlighted Flash-MoE, a pure C/Metal inference stack that streams Qwen3.5-397B-A17B from SSD and reaches interactive speeds on a 48GB M3 Max laptop.

#llm #mixture-of-experts #metal

LLM Hacker News Mar 22, 2026 2 min read

Flash-MoE: Running a 397B Parameter Model on a Laptop

Flash-MoE is a C and Metal inference engine that claims to run Qwen3.5-397B-A17B on a 48 GB MacBook Pro. The key idea is to keep a 209 GB MoE model on SSD and stream only the active experts needed for each token.

#llm #moe #metal

LLM Reddit Mar 12, 2026 2 min read

Reddit Flags a New llama.cpp Metal Speedup for Qwen 3.5 on Mac

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

#llama.cpp #qwen #apple-silicon