Google splits its 8th-gen TPU line in two for agent training and inference

Google is no longer pretending one AI chip can do everything well enough for the agent era. On April 28, the company introduced its eighth-generation TPU line with a split architecture: TPU 8t for massive training jobs and TPU 8i for low-latency inference. That separation is the real story. AI agents do not just train once and serve static outputs; they plan, call tools, wait on each other and loop through reasoning steps. Google is designing hardware around that workload, not around a generic benchmark race. The source is Google’s post Our eighth generation TPUs: two chips for the agentic era.

The training side is large even by hyperscale standards. Google says a TPU 8t superpod scales to 9,600 chips, two petabytes of shared high-bandwidth memory and 121 ExaFlops of compute, with nearly 3x compute performance per pod over the previous generation. It also claims 10x faster storage access and near-linear scaling up to a million chips in a single logical cluster through Virgo Network, JAX and Pathways. The message is simple: frontier training cycles that used to eat months are supposed to compress into weeks.

TPU 8i targets the other bottleneck: inference systems that bog down when many agents have to cooperate in real time. Google says 8i pairs 288 GB of high-bandwidth memory with 384 MB of on-chip SRAM, doubles ICI bandwidth to 19.2 Tb/s and cuts on-chip latency by up to 5x with a Collectives Acceleration Engine. The commercial promise is more concrete than most chip launches: 80% better performance-per-dollar than the previous generation and nearly twice the served customer volume at the same cost.

The broader implication is that Google is trying to turn infrastructure specialization into a moat. Both chips run on Axion Arm-based CPU hosts, both are tied to Google’s AI Hypercomputer stack, and both are co-designed around Gemini and modern reasoning workloads. That gives Google a cleaner pitch to developers and enterprises who want more than model access; they want predictable economics when agents move from demos into always-on production. If these numbers hold when the systems are generally available later this year, the chip war shifts from raw compute bragging to a more practical question: who can run swarms of AI agents without letting latency, memory and power become the whole product?

Google splits its 8th-gen TPU line in two for agent training and inference

Related Articles

HN Wants One Thing From TorchTPU: Make `device="tpu"` Real

Google shifts more ML spend to Cloud as Gemini moves from pilots to fleets

Meta turns to AWS Graviton as agentic AI shifts the CPU bottleneck

Comments (0)

Leave a Comment

Related Articles

HN Wants One Thing From TorchTPU: Make `device="tpu"` Real

Google shifts more ML spend to Cloud as Gemini moves from pilots to fleets

Meta turns to AWS Graviton as agentic AI shifts the CPU bottleneck