NVIDIA TwoTower keeps 98.7% quality while generating 2.42x faster
Original: NVIDIA TwoTower keeps 98.7% quality while generating 2.42x faster View original →
A diffusion route to faster decoding
NVIDIA Research is pushing on a core bottleneck in language models: generating one token at a time. Its new Nemotron-Labs-TwoTower model adapts a 30B-class Nemotron backbone into a two-tower diffusion language model, aiming to keep most of the original quality while committing multiple tokens in parallel.
“We found it kept 98.7% of the original model’s quality at 2.42× faster generation.”
The source tweet was posted by NVIDIA AI on July 1, 2026 at 19:00:01 UTC, inside the 48-hour cutoff. NVIDIA AI usually posts research, developer tooling, and AI infrastructure updates. This item is material because it gives both a model architecture and a measured tradeoff, not only a release link. A follow-up pointed to the Hugging Face checkpoint for Nemotron-Labs-TwoTower-30B-A3B-Base-BF16.
The model card provides the technical context. Nemotron-Labs-TwoTower is described as a block-wise autoregressive diffusion model built on the NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 backbone. One tower processes the clean prompt and previous tokens as the context tower; the other works as a denoiser that fills token blocks. NVIDIA says the default configuration uses confidence unmasking with block size 16 on 2 H100 GPUs, retaining 98.7% of the autoregressive baseline’s aggregate benchmark quality while producing 2.42x wall-clock generation throughput.
The concrete tradeoff is visible in benchmark examples. The model card lists MMLU at 78.24 versus 78.56 for the baseline, HumanEval at 75.58 versus 79.27, and MATH-500 at 80.60 versus 84.40. That is not a free speedup; some tasks lose accuracy. But for workloads where latency and throughput dominate, keeping nearly all aggregate quality while more than doubling generation speed is a serious systems result.
Next, watch whether two-tower diffusion decoding holds up outside curated benchmarks: long-context generation, tool calls, code editing, multilingual tasks, and safety filters can stress decoding methods differently. The other question is deployment cost, because the released checkpoint contains both towers and the default run uses 2 H100 GPUs. Faster tokens matter most when the total serving economics still improve.
Related Articles
LocalLLaMA focused on the practical question: can a diffusion LLM keep quality while making generation meaningfully faster?
NVIDIA said on March 25, 2026 that Nemotron Nano 12B v2 VL delivers on-prem video understanding and, in NVIDIA's telling, performs near 30B-class alternatives on the MediaPerf benchmark at less than half the footprint. NVIDIA's model card describes it as a commercially usable multimodal model for multi-image reasoning, video understanding, visual Q&A, and summarization.
Open-model competition is shifting from leaderboard scores to agent operating costs. NVIDIA says Nemotron 3 Ultra is a 550B MoE model with 5x faster inference and up to 30% lower cost for complex agentic tasks.