NVIDIA TwoTower keeps 98.7% quality while generating 2.42x faster

A diffusion route to faster decoding

NVIDIA Research is pushing on a core bottleneck in language models: generating one token at a time. Its new Nemotron-Labs-TwoTower model adapts a 30B-class Nemotron backbone into a two-tower diffusion language model, aiming to keep most of the original quality while committing multiple tokens in parallel.

“We found it kept 98.7% of the original model’s quality at 2.42× faster generation.”

The source tweet was posted by NVIDIA AI on July 1, 2026 at 19:00:01 UTC, inside the 48-hour cutoff. NVIDIA AI usually posts research, developer tooling, and AI infrastructure updates. This item is material because it gives both a model architecture and a measured tradeoff, not only a release link. A follow-up pointed to the Hugging Face checkpoint for Nemotron-Labs-TwoTower-30B-A3B-Base-BF16.

The model card provides the technical context. Nemotron-Labs-TwoTower is described as a block-wise autoregressive diffusion model built on the NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 backbone. One tower processes the clean prompt and previous tokens as the context tower; the other works as a denoiser that fills token blocks. NVIDIA says the default configuration uses confidence unmasking with block size 16 on 2 H100 GPUs, retaining 98.7% of the autoregressive baseline’s aggregate benchmark quality while producing 2.42x wall-clock generation throughput.

The concrete tradeoff is visible in benchmark examples. The model card lists MMLU at 78.24 versus 78.56 for the baseline, HumanEval at 75.58 versus 79.27, and MATH-500 at 80.60 versus 84.40. That is not a free speedup; some tasks lose accuracy. But for workloads where latency and throughput dominate, keeping nearly all aggregate quality while more than doubling generation speed is a serious systems result.

Next, watch whether two-tower diffusion decoding holds up outside curated benchmarks: long-context generation, tool calls, code editing, multilingual tasks, and safety filters can stress decoding methods differently. The other question is deployment cost, because the released checkpoint contains both towers and the default run uses 2 H100 GPUs. Faster tokens matter most when the total serving economics still improve.

NVIDIA TwoTower keeps 98.7% quality while generating 2.42x faster

A diffusion route to faster decoding

Related Articles

NVIDIA’s Nemotron-TwoTower tests diffusion-style generation for LLMs

NVIDIA positions Nemotron Nano 12B v2 VL as a compact open model for on-prem video understanding

Nemotron 3 Ultra uses 550B MoE design to cut agent costs by 30%