#moe

LLM Reddit 16h ago 2 min read

DeepSeek V4 Lands on Hugging Face and LocalLLaMA Immediately Starts Doing the RAM Math

LocalLLaMA did not just celebrate the DeepSeek V4 release. The thread instantly turned into a collective calculation about 1M context, activated parameters, and what this actually means for real hardware, with MIT license praise mixed in.

#deepseek-v4 #open-weights #moe

LLM Hacker News 2d ago 2 min read

HN Spots the Real DeepSeek V4 Story: The Docs Link Was Thin, but the Weights Were Already Live

HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.

#deepseek #llm #moe

LLM Reddit 3d ago 2 min read

Why LocalLLaMA treated DeepEP V2 and TileKernels as more than just another infra drop

LocalLLaMA upvoted this because it felt like real plumbing, not another benchmark screenshot. The excitement was about DeepSeek open-sourcing faster expert-parallel communication and reusable GPU kernels.

#deepseek #deepep #tilekernels

AI sources.twitter Apr 17, 2026 2 min read

Qwen3.6-35B-A3B opens 35B MoE weights with 3B active parameters

Why it matters: Alibaba is putting a small-active-parameter multimodal coding model into open weights rather than keeping it API-only. The tweet says Qwen3.6-35B-A3B has 35B total parameters, 3B active parameters, and an Apache 2.0 license; the blog reports 73.4 on SWE-bench Verified and 51.5 on Terminal-Bench 2.0.

#qwen #open-weights #moe

LLM Hacker News Apr 16, 2026 1 min read

HN Sees Qwen3.6-35B-A3B as a Small Active-Parameter Bet for Coding Agents

HN latched onto the open-weight angle: a 35B MoE model with only 3B active parameters is interesting if it can actually carry coding-agent work. Qwen says Qwen3.6-35B-A3B improves sharply over Qwen3.5-35B-A3B, while commenters immediately moved to GGUF builds, Mac memory limits, and whether open-model-only benchmark tables are enough context.

#qwen #open-weights #coding-agents

LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Finds a Practical Speed Trick in Caching Hot MoE Experts in VRAM

LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.

#local-llm #llama-cpp #moe

LLM sources.twitter Apr 8, 2026 2 min read

Cursor details warp decode for Blackwell GPUs, claiming 1.84x faster MoE inference

On April 6, 2026, Cursor said on X that it rebuilt how MoE models generate tokens on NVIDIA Blackwell GPUs. In a companion engineering post, the company said its "warp decode" approach improves throughput by 1.84x while producing outputs 1.4x closer to an FP32 reference.

#cursor #moe #inference

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Tracks NVIDIA's gpt-oss-puzzle-88B as Puzzle Shrinks gpt-oss-120b for Cheaper Serving

A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.

#nvidia #gpt-oss #open-weights

LLM Hacker News Mar 22, 2026 2 min read

Flash-MoE: Running a 397B Parameter Model on a Laptop

Flash-MoE is a C and Metal inference engine that claims to run Qwen3.5-397B-A17B on a 48 GB MacBook Pro. The key idea is to keep a 209 GB MoE model on SSD and stream only the active experts needed for each token.

#llm #moe #metal

LLM Reddit Mar 19, 2026 2 min read

LocalLLaMA Tracks Mistral Small 4 as Mistral Collapses Instruct, Reasoning, and Devstral Into One MoE

A March 16, 2026 r/LocalLLaMA post about Mistral Small 4 reached 606 points and 232 comments in the latest available crawl. Mistral’s model card describes a 119B-parameter MoE with 4 active experts, 256k context, multimodal input, and a per-request switch between standard and reasoning modes.

#mistral #multimodal #reasoning

LLM Hacker News Mar 16, 2026 2 min read

Hacker News Surfaces a Visual Reference for Modern LLM Architectures

Sebastian Raschka's LLM Architecture Gallery drew attention on HN for turning recent model families into comparable diagrams, making dense, MoE, and hybrid design choices easier to scan in one place.

#llm-architectures #transformers #moe

LLM Reddit Mar 1, 2026 1 min read

Qwen 3.5-35B-A3B Surpasses GPT-OSS-120B as Daily Driver at 1/3 the Size

The r/LocalLLaMA community is buzzing over Qwen 3.5-35B-A3B, which users report outperforms GPT-OSS-120B while being only one-third the size, making it an excellent local daily driver for development tasks.

#qwen #local-llm #open-source