Hacker News spots Hypura running oversized LLMs on Macs with tier-aware scheduling
Original: Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon View original →
Hacker News noticed Hypura because it tackles a familiar local-inference failure mode on Apple Silicon: models that technically fit on disk but exceed practical memory once inference starts. The project describes itself as a storage-tier-aware scheduler that places model tensors across GPU memory, system RAM, and NVMe instead of assuming every weight has to live in one tier at all times.
According to the repository README, Hypura profiles the machine, reads the GGUF layout, and then decides which tensors should stay on GPU, spill into RAM, or stream from NVMe on demand. Norms and embeddings are pinned close to compute because they are touched every token. For MoE models, the scheduler intercepts router decisions and loads only the experts that actually fire, while a neuron cache tries to exploit temporal locality across tokens. For dense models, large FFN weights are streamed through a buffer with predictive prefetch.
- The README says a 31 GB Mixtral 8x7B model runs on a 32 GB Mac Mini at 2.2 tok/s.
- It also reports a 40 GB Llama 70B configuration at 0.3 tok/s on the same memory class, where vanilla
llama.cppwould OOM. - For expert streaming, the project claims 75% less I/O and a 99.5% neuron-cache hit rate after warmup.
The point is not that NVMe becomes magically as fast as VRAM. Hypura's argument is that model architecture matters enough to make tiered scheduling worthwhile. MoE sparsity means only a small subset of weights is hot at each token, and even dense models have components that benefit disproportionately from staying resident. By treating storage as a cold tier rather than a fatal fallback, the project tries to turn “doesn't load” into “runs, but slower.”
That makes the HN interest understandable. Local LLM tooling on Macs is increasingly constrained by memory ceilings rather than raw GPU availability, and Hypura is one of the clearer attempts to turn those ceilings into a scheduling problem. The repo also exposes an Ollama-compatible server, suggesting the authors are thinking about interoperability rather than only benchmark screenshots.
Primary source: Hypura repository. Community source: Hacker News thread.
Related Articles
A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.
A Hacker News discussion highlighted Flash-MoE, a pure C/Metal inference stack that streams Qwen3.5-397B-A17B from SSD and reaches interactive speeds on a 48GB M3 Max laptop.
A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.
Comments (0)
No comments yet. Be the first to comment!