NVIDIA Vera shifts the AI-agent bottleneck from GPUs to CPUs
Original: NVIDIA Unveils Vera, the CPU for Agents View original →
The cost story for AI agents is no longer just about GPUs and tokens. With Vera, NVIDIA is arguing that the next bottleneck is the CPU work around the model: running code, coordinating tools, processing data, and validating results while accelerators wait. The May 31, 2026 release says Vera is in full production and delivers 1.8x faster task completion than x86 CPUs across agentic AI, reinforcement learning, and data-processing workloads.
The hardware pitch is specific. Vera uses 88 custom Olympus cores, Spatial Multithreading, and an LPDDR5X memory subsystem rated at up to 1.2TB/s. In Vera Rubin systems, the CPU connects to GPUs through second-generation NVLink-C2C with up to 1.8TB/s of coherent bandwidth. NVIDIA positions it not as a generic host CPU but as the processor for Python runtimes, sandboxed code execution, orchestration logic, analytics pipelines, and other CPU-bound steps inside modern AI factories.
The ecosystem list is why this qualifies as more than a component update. NVIDIA names Anthropic, OpenAI, SpaceXAI, ByteDance, CoreWeave, Oracle Cloud Infrastructure, Lambda, Nebius, and Nscale among customers exploring or planning around Vera. Dell Technologies, HPE, Lenovo, Supermicro, and major Taiwan system builders are listed as system partners. NYSE also appears as an early infrastructure example, citing systems that process more than 1.1 trillion messages per day.
The useful test comes this fall, when Vera systems are expected from system builders and cloud partners. Buyers will need real measurements on agent throughput, energy use, sandbox latency, and operational cost rather than keynote arithmetic. Still, the direction is clear: as agents get longer-running and more tool-heavy, AI infrastructure has to optimize the whole loop, not only model inference on the accelerator.
Related Articles
Perplexity is replacing serial search calls with generated Python that composes retrieval primitives inside agent harnesses. In one CVE advisory case study, it says token use fell 85.1%, from 288.7K to 42.9K.
NVIDIA outlined a Rubin-based DGX SuperPOD architecture that combines compute, networking, and operations software as one deployment stack. The company claims up to 10x lower inference token cost versus the prior generation and targets availability in the second half of 2026.
NVIDIA said GTC 2026 will run March 16-19 in San Jose, California. The company projects 30,000+ attendees from 190+ countries and more than 1,000 sessions across the AI stack. The program includes Jensen Huang’s keynote, hands-on labs, startup showcases, and an analyst Q&A session.