NVIDIA’s Nemotron-TwoTower tests diffusion-style generation for LLMs
Original: NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. View original →
NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16 on Hugging Face, and the LocalLLaMA thread picked up on it because the decoding approach is unusual. Rather than generating strictly one token at a time, the model uses a block-wise autoregressive diffusion setup built on the Nemotron 3 Nano 30B-A3B backbone.
The architecture is split into two towers. The AR/context tower processes the prompt and already committed tokens, producing attention KV cache and Mamba states. The diffusion/denoiser tower works on the current noisy block, using bidirectional attention inside the block and layer-aligned cross-attention into the context tower. It predicts multiple masked positions, commits high-confidence tokens, and repeats until the block is resolved.
NVIDIA’s headline numbers explain the interest. At the default operating point, the model claims to retain 98.7% of the autoregressive baseline’s aggregate benchmark quality while reaching 2.42 times the baseline’s wall-clock generation throughput. Lowering the confidence threshold can commit more tokens per step and increase speed, with a quality trade-off.
This is not just another open checkpoint for local users to try. It is a concrete test of whether diffusion-style text generation can become a practical inference path for LLMs. The remaining questions are serving complexity, hardware requirements, conversational quality, and how the model behaves outside benchmark-style prompts.
For the local LLM community, the release broadens the speed conversation. Speculative decoding is no longer the only obvious route to faster generation; the decoding architecture itself is becoming an experimental surface.
Related Articles
NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.
NVIDIA is targeting the hidden cost of LLM serving experiments. Its DynoSim post says the Rust simulator can screen deployment choices before GPU validation, with a blog example replaying 23,608 requests about 1,500x faster than real time.
The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.