LocalLLaMA did not just celebrate the DeepSeek V4 release. The thread instantly turned into a collective calculation about 1M context, activated parameters, and what this actually means for real hardware, with MIT license praise mixed in.
#moe
RSS FeedHN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
LocalLLaMA upvoted this because it felt like real plumbing, not another benchmark screenshot. The excitement was about DeepSeek open-sourcing faster expert-parallel communication and reusable GPU kernels.
Why it matters: Alibaba is putting a small-active-parameter multimodal coding model into open weights rather than keeping it API-only. The tweet says Qwen3.6-35B-A3B has 35B total parameters, 3B active parameters, and an Apache 2.0 license; the blog reports 73.4 on SWE-bench Verified and 51.5 on Terminal-Bench 2.0.
HN latched onto the open-weight angle: a 35B MoE model with only 3B active parameters is interesting if it can actually carry coding-agent work. Qwen says Qwen3.6-35B-A3B improves sharply over Qwen3.5-35B-A3B, while commenters immediately moved to GGUF builds, Mac memory limits, and whether open-model-only benchmark tables are enough context.
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
On April 6, 2026, Cursor said on X that it rebuilt how MoE models generate tokens on NVIDIA Blackwell GPUs. In a companion engineering post, the company said its "warp decode" approach improves throughput by 1.84x while producing outputs 1.4x closer to an FP32 reference.
A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.
Flash-MoE is a C and Metal inference engine that claims to run Qwen3.5-397B-A17B on a 48 GB MacBook Pro. The key idea is to keep a 209 GB MoE model on SSD and stream only the active experts needed for each token.
A March 16, 2026 r/LocalLLaMA post about Mistral Small 4 reached 606 points and 232 comments in the latest available crawl. Mistral’s model card describes a 119B-parameter MoE with 4 active experts, 256k context, multimodal input, and a per-request switch between standard and reasoning modes.
Sebastian Raschka's LLM Architecture Gallery drew attention on HN for turning recent model families into comparable diagrams, making dense, MoE, and hybrid design choices easier to scan in one place.
The r/LocalLLaMA community is buzzing over Qwen 3.5-35B-A3B, which users report outperforms GPT-OSS-120B while being only one-third the size, making it an excellent local daily driver for development tasks.