DeepSeek V4 Lands on Hugging Face and LocalLLaMA Immediately Starts Doing the RAM Math
Original: Deepseek V4 Flash and Non-Flash Out on HuggingFace View original →
The first reaction was not applause, it was hardware arithmetic
r/LocalLLaMA reacted to DeepSeek V4 in a very recognizable way: the moment the collection showed up on Hugging Face, users started translating the release into RAM, VRAM, and deployment pain. The top comment was pure self-hosting regret about not overbuilding memory when assembling a machine. Another joked that what the release really needs is a 0.01-bit quant. That tone matters. In open-model communities, a launch is not judged only by benchmark tables. It is judged by whether people can imagine living with it.
According to DeepSeek’s model card, this preview series has two branches. DeepSeek-V4-Pro uses 1.6T total parameters with 49B activated, while DeepSeek-V4-Flash uses 284B total parameters with 13B activated. Both support a 1M-token context window. The technical upgrades cited include a hybrid attention design combining CSA and HCA, mHC for stronger residual-style signal propagation, and the Muon optimizer. DeepSeek also claims that in the 1M-context setting, V4-Pro needs only 27% of the single-token inference FLOPs and 10% of the KV cache required by DeepSeek-V3.2. The pretraining corpus is listed at more than 32T tokens.
What the subreddit locked onto
The comment thread quickly split into two useful tracks. One side focused on accessibility and licensing, especially the fact that the release ships under an MIT license. The other side drilled into capability. Readers pulled benchmark lines from the model card and noted that V4-Pro Max is positioned strongly on coding and agentic work, including reported scores such as LiveCodeBench 93.5, Terminal Bench 2.0 67.9, SWE Verified 80.6, and MCPAtlas 73.6. But even those impressed by the numbers kept circling back to deployment reality. One of the most upvoted reactions asked, in effect, how wealthy someone would need to be to run Flash locally at all.
Why it matters
This is what maturity looks like in open-weight communities. A model is no longer evaluated only on leaderboard prestige. Users also care about activated parameter count, context budget, licensing, KV cache behavior, and how plausible self-hosting is outside a data-center budget. The DeepSeek V4 thread captured that shift perfectly. People were excited, but they were excited in operational terms. The release says frontier capability is not enough by itself. If open models want to matter beyond screenshots, they have to meet the community at the point where benchmark ambition collides with memory topology and real deployment cost.
Related Articles
HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.
A popular r/LocalLLaMA thread argues that MiniMax M2.7 should be treated as an open-weights release with a restricted license, not as open source, because commercial use requires prior written authorization.
Comments (0)
No comments yet. Be the first to comment!