Google splits its 8th-gen TPU line in two for agent training and inference
Original: Our eighth generation TPUs: two chips for the agentic era View original →
Google is no longer pretending one AI chip can do everything well enough for the agent era. On April 28, the company introduced its eighth-generation TPU line with a split architecture: TPU 8t for massive training jobs and TPU 8i for low-latency inference. That separation is the real story. AI agents do not just train once and serve static outputs; they plan, call tools, wait on each other and loop through reasoning steps. Google is designing hardware around that workload, not around a generic benchmark race. The source is Google’s post Our eighth generation TPUs: two chips for the agentic era.
The training side is large even by hyperscale standards. Google says a TPU 8t superpod scales to 9,600 chips, two petabytes of shared high-bandwidth memory and 121 ExaFlops of compute, with nearly 3x compute performance per pod over the previous generation. It also claims 10x faster storage access and near-linear scaling up to a million chips in a single logical cluster through Virgo Network, JAX and Pathways. The message is simple: frontier training cycles that used to eat months are supposed to compress into weeks.
TPU 8i targets the other bottleneck: inference systems that bog down when many agents have to cooperate in real time. Google says 8i pairs 288 GB of high-bandwidth memory with 384 MB of on-chip SRAM, doubles ICI bandwidth to 19.2 Tb/s and cuts on-chip latency by up to 5x with a Collectives Acceleration Engine. The commercial promise is more concrete than most chip launches: 80% better performance-per-dollar than the previous generation and nearly twice the served customer volume at the same cost.
The broader implication is that Google is trying to turn infrastructure specialization into a moat. Both chips run on Axion Arm-based CPU hosts, both are tied to Google’s AI Hypercomputer stack, and both are co-designed around Gemini and modern reasoning workloads. That gives Google a cleaner pitch to developers and enterprises who want more than model access; they want predictable economics when agents move from demos into always-on production. If these numbers hold when the systems are generally available later this year, the chip war shifts from raw compute bragging to a more practical question: who can run swarms of AI agents without letting latency, memory and power become the whole product?
Related Articles
HN did not read Google’s TorchTPU post as another cloud pitch. The real question in the thread was whether a PyTorch user can really switch to `tpu` without falling back into the old PyTorch/XLA pain cave.
Meta will add tens of millions of AWS Graviton cores, a sign that the AI infrastructure race is no longer just about GPUs. The company argues that agentic AI is inflating CPU-heavy work such as planning, orchestration, and data movement, making Graviton5 a strategic fit.
Google is signaling that enterprise AI is moving from demos to operational scale. In its April 22 Cloud Next update, the company said customer API traffic has risen to more than 16 billion tokens per minute and that just over half of its 2026 machine-learning compute investment will go to the Cloud business.
Comments (0)
No comments yet. Be the first to comment!