NVIDIA positions Groq 3 LPX as the low-latency inference rack for Vera Rubin
Original: 🚀 Announced at #NVIDIAGTC: NVIDIA Groq 3 LPX, a new rack-scale low-latency inference accelerator for the #NVIDIAVeraRubin platform. Co-designed with Vera Rubin NVL72 — LPX accelerates token generation while Vera Rubin NVL72 powers large-scale training and inference. Together, https://t.co/l1tbGiBL2B View original →
What the X post announced
On March 17, 2026, the NVIDIADC account presented NVIDIA Groq 3 LPX as a new rack-scale low-latency inference accelerator for the Vera Rubin platform. The post also made the architectural split explicit: LPX accelerates token generation, while Vera Rubin NVL72 handles large-scale training and inference. That is more than product marketing shorthand. It is a clear statement about how NVIDIA wants next-generation AI factories to divide work.
The key idea is not simply a faster chip. It is a heterogeneous serving architecture in which different phases of model execution are mapped to different hardware. Rubin GPUs remain the broad, high-throughput workhorse, while LPX is optimized for the latency-sensitive part of interactive generation.
What NVIDIA added
In its March 16 newsroom announcement, NVIDIA says LPX is designed for the low-latency and large-context demands of agentic systems. According to the company, an LPX rack includes 256 LPUs, 128 GB of on-chip SRAM, and 640 TB/s of scale-up bandwidth. NVIDIA also says the LPX and Vera Rubin combination can deliver up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models, with availability expected in the second half of 2026.
NVIDIA’s technical blog goes further into the system design. It lists 315 PFLOPS of FP8 compute at rack scale and explains that LPX is intended to accelerate latency-sensitive decode work such as FFN and MoE expert execution, while Rubin GPUs continue to handle prefill and decode attention. In other words, NVIDIA is treating the next inference bottleneck as a systems problem, not only a GPU generation problem.
Why this is high-signal
This announcement matters because AI infrastructure competition is moving beyond training benchmarks and raw accelerator counts toward interactive token generation economics. Agentic systems consume more tokens, run tighter tool loops, and place much more value on predictable latency. LPX is NVIDIA’s attempt to create a premium rack tier specifically for that regime.
Of course, many of the performance and revenue numbers are vendor claims and forward-looking by nature. But even with that caveat, the March 17 X post and the March 16 NVIDIA materials together show a clear strategic shift. NVIDIA is not only selling a bigger training platform. It is defining a stack in which training-scale throughput and ultra-low-latency inference become separate, co-designed layers of the same AI factory. That is a meaningful signal for developers building agentic coding, multi-agent, and real-time AI products.
Sources: NVIDIADC X post · NVIDIA newsroom announcement · NVIDIA technical blog
Related Articles
Thinking Machines Lab said it signed a multi-year strategic partnership with NVIDIA to deploy at least one gigawatt of next-generation Vera Rubin systems. The companies also plan to co-design training and serving systems and widen access to frontier AI and open models for enterprises, research institutions, and the scientific community.
NVIDIA and Thinking Machines Lab said on March 10, 2026 that they will deploy at least one gigawatt of next-generation NVIDIA Vera Rubin systems under a multiyear partnership. The agreement also covers co-design of training and serving systems plus an NVIDIA investment in Thinking Machines Lab.
NVIDIA and Emerald AI said they are working with major energy companies to design AI factories that connect to the grid faster and can also support grid reliability. The plan centers on Vera Rubin DSX, DSX Flex, and Emerald AI's Conductor platform.
Comments (0)
No comments yet. Be the first to comment!