Why LocalLLaMA treated DeepEP V2 and TileKernels as more than just another infra drop
Original: Deepseek has released DeepEP V2 and TileKernels. View original →
LocalLLaMA liked the plumbing story
The LocalLLaMA thread around DeepEP V2 and TileKernels had a specific kind of excitement: this was not another pretty benchmark screenshot. It was infra work. People upvoted it because faster expert-parallel communication and better kernels directly change what open MoE systems can train and serve, and because DeepSeek keeps publishing pieces of that stack instead of treating them as untouchable internal sauce.
The DeepEP V2 release notes describe a full refactor of expert parallelism. The new version unifies high-throughput and low-latency APIs, switches from NVSHMEM to a lighter NCCL Gin backend, and supports much larger scale-up and scale-out domains up to EP2048. DeepSeek also says V2 can hit up to 1.3x the peak performance of V1 while using up to 4x fewer SMs, alongside experimental 0-SM Engram, pipeline parallel, and context parallel all-gather features.
TileKernels fills in the other half of the story. The new library, built on TileLang, bundles optimized GPU kernels for MoE gating and routing, quantization, transpose ops, engram gating, manifold hyperconnection, and higher-level torch autograd wrappers. In short, DeepSeek is not only improving the communication layer but also opening a reusable kernel toolbox for the kinds of operations that dominate LLM infrastructure work.
- MoE performance is increasingly about routing and communication, not just weights.
- Lower SM usage means more room to balance system resources under real workloads.
- Open infra code compounds because other teams can test, adapt, and build on it immediately.
The top Reddit comments captured that mood well. People praised DeepSeek for acting like a research lab that still ships its systems work to the public. That goodwill is not just ideological. For the open-model community, releases like DeepEP V2 and TileKernels are leverage. They make the hard, unglamorous parts of MoE systems a little less mysterious and a little more portable.
Related Articles
A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.
On April 6, 2026, Cursor said on X that it rebuilt how MoE models generate tokens on NVIDIA Blackwell GPUs. In a companion engineering post, the company said its "warp decode" approach improves throughput by 1.84x while producing outputs 1.4x closer to an FP32 reference.
IBM Research’s VAKRA moves agent evaluation from static Q&A into executable tool environments. With 8,000+ locally hosted APIs across 62 domains and 3-7 step reasoning chains, the benchmark finds a gap between surface tool use and reliable enterprise agents.
Comments (0)
No comments yet. Be the first to comment!