Google splits its 8th-gen TPU line in two for agent training and inference

Original: Our eighth generation TPUs: two chips for the agentic era View original →

Read in other languages: 한국어日本語
AI Apr 28, 2026 By Insights AI 2 min read 1 views Source

Google is no longer pretending one AI chip can do everything well enough for the agent era. On April 28, the company introduced its eighth-generation TPU line with a split architecture: TPU 8t for massive training jobs and TPU 8i for low-latency inference. That separation is the real story. AI agents do not just train once and serve static outputs; they plan, call tools, wait on each other and loop through reasoning steps. Google is designing hardware around that workload, not around a generic benchmark race. The source is Google’s post Our eighth generation TPUs: two chips for the agentic era.

The training side is large even by hyperscale standards. Google says a TPU 8t superpod scales to 9,600 chips, two petabytes of shared high-bandwidth memory and 121 ExaFlops of compute, with nearly 3x compute performance per pod over the previous generation. It also claims 10x faster storage access and near-linear scaling up to a million chips in a single logical cluster through Virgo Network, JAX and Pathways. The message is simple: frontier training cycles that used to eat months are supposed to compress into weeks.

TPU 8i targets the other bottleneck: inference systems that bog down when many agents have to cooperate in real time. Google says 8i pairs 288 GB of high-bandwidth memory with 384 MB of on-chip SRAM, doubles ICI bandwidth to 19.2 Tb/s and cuts on-chip latency by up to 5x with a Collectives Acceleration Engine. The commercial promise is more concrete than most chip launches: 80% better performance-per-dollar than the previous generation and nearly twice the served customer volume at the same cost.

The broader implication is that Google is trying to turn infrastructure specialization into a moat. Both chips run on Axion Arm-based CPU hosts, both are tied to Google’s AI Hypercomputer stack, and both are co-designed around Gemini and modern reasoning workloads. That gives Google a cleaner pitch to developers and enterprises who want more than model access; they want predictable economics when agents move from demos into always-on production. If these numbers hold when the systems are generally available later this year, the chip war shifts from raw compute bragging to a more practical question: who can run swarms of AI agents without letting latency, memory and power become the whole product?

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.