Hacker News spots Hypura running oversized LLMs on Macs with tier-aware scheduling

Hacker News noticed Hypura because it tackles a familiar local-inference failure mode on Apple Silicon: models that technically fit on disk but exceed practical memory once inference starts. The project describes itself as a storage-tier-aware scheduler that places model tensors across GPU memory, system RAM, and NVMe instead of assuming every weight has to live in one tier at all times.

According to the repository README, Hypura profiles the machine, reads the GGUF layout, and then decides which tensors should stay on GPU, spill into RAM, or stream from NVMe on demand. Norms and embeddings are pinned close to compute because they are touched every token. For MoE models, the scheduler intercepts router decisions and loads only the experts that actually fire, while a neuron cache tries to exploit temporal locality across tokens. For dense models, large FFN weights are streamed through a buffer with predictive prefetch.

The README says a 31 GB Mixtral 8x7B model runs on a 32 GB Mac Mini at 2.2 tok/s.
It also reports a 40 GB Llama 70B configuration at 0.3 tok/s on the same memory class, where vanilla llama.cpp would OOM.
For expert streaming, the project claims 75% less I/O and a 99.5% neuron-cache hit rate after warmup.

The point is not that NVMe becomes magically as fast as VRAM. Hypura's argument is that model architecture matters enough to make tiered scheduling worthwhile. MoE sparsity means only a small subset of weights is hot at each token, and even dense models have components that benefit disproportionately from staying resident. By treating storage as a cold tier rather than a fatal fallback, the project tries to turn “doesn't load” into “runs, but slower.”

That makes the HN interest understandable. Local LLM tooling on Macs is increasingly constrained by memory ceilings rather than raw GPU availability, and Hypura is one of the clearer attempts to turn those ceilings into a scheduling problem. The repo also exposes an Ollama-compatible server, suggesting the authors are thinking about interoperability rather than only benchmark screenshots.

Primary source: Hypura repository. Community source: Hacker News thread.

Hacker News spots Hypura running oversized LLMs on Macs with tier-aware scheduling

Related Articles

LocalLLaMA Highlights a Sparse V Dequant Trick for TurboQuant in llama.cpp

Discontinued Intel Optane Memory Runs 1 Trillion Parameter LLM Locally at 4 Tokens/Sec

Gemma 4 12B puts the spotlight on encoder-free multimodal local AI