Local LLM users want the missing 80-160B middle
Original: We need a 80-160B model urgently. The unified memory device market needs more Models. View original →
A new LocalLLaMA discussion put a practical gap in the local model market into plain terms: recent releases cluster around fast 27B-35B models or huge frontier-style MoE systems, while users with 80-128GB-class memory setups have fewer fresh choices. The post names Apple devices with more than 96GB memory, Ryzen AI 395 systems, DGX Spark, RTX 6000 Pro, multi-3090 rigs, and large DDR4/DDR5 machines as examples of hardware that has capacity but not always the bandwidth for the largest current models.
The complaint is not that small models are bad. Qwen and Gemma-class releases have made local inference much more useful for coding, private documents, and automation. The problem is that many buyers now sit between categories. They can fit more than a 35B model, but the latest massive models such as GLM 5.2, DeepSeek V4 Pro, Kimi, or MiniMax are too large or too slow for comfortable local use. That leaves older 80B-120B models, or a step down to smaller current models.
The thread’s concrete ask is a sparse model around 100B total parameters with roughly 10B active parameters, tuned for systems with 64GB VRAM or 80-128GB unified memory. That target says a lot about where local AI demand is moving. Users are no longer only asking whether a model can fit. They are asking whether the quality jump is worth the tokens per second, whether long context fits without painful memory pressure, and whether consumer or prosumer machines can run something close enough to current closed-model utility.
Community replies dug into attention mechanisms and memory bandwidth. Hybrid or linear attention could make very long context cheaper, but several users pointed out that unified memory capacity does not erase throughput limits. This is the kind of hardware-shaped demand model labs can miss if they optimize only for hosted APIs or headline benchmark scores. A credible 80-160B tier could become the practical bridge between small daily-driver models and the largest open weights systems.
Source: r/LocalLLaMA.
Related Articles
MiniMax has moved M3 from model teaser to open-weight distribution. The Hugging Face card lists about 428B total parameters, 23B activated parameters, and a 1M-token context window.
The LocalLLaMA angle is not just the 1000+ tps headline, but whether FP4, DFlash, and commodity GPU kernels can be reproduced outside Xiaomi’s hosted trial.
HN focused less on whether local LLMs fully replace frontier models and more on where they already make sense. The thread turned into a practical debate about Gemma, Qwen, agentic coding, memory limits, cost, and privacy.