DeepMind trains a 12B model across four regions 20x faster
Original: This is Decoupled DiLoCo: our new resilient and flexible way to train advanced AI models across multiple data centres. View original →
Google DeepMind used its April 23 source thread to introduce Decoupled DiLoCo as a resilient and flexible way to train advanced AI models across multiple data centres. That framing matters because it targets a growing frontier bottleneck: not model quality, but the brittleness of keeping giant clusters synchronized across hardware failures and datacenter boundaries.
The linked blog post puts hard numbers behind the claim. Google DeepMind says Decoupled DiLoCo trained a 12 billion parameter Gemma model across four separate U.S. regions using 2-5 Gbps wide-area networking, and did so more than 20 times faster than conventional synchronization methods. On benchmarked ML performance, it reached 64.1% average accuracy, almost matching the 64.4% baseline while using dramatically less bandwidth.
Another important detail is failure tolerance. In simulated large-scale outages, DeepMind says the system held 88% goodput versus 27% for standard data-parallel training. The setup can also combine TPU v6e and TPU v5p in one training run without losing ML performance, which matters for any lab trying to use partially upgraded fleets instead of waiting for perfectly matched clusters. The same figure set says required bandwidth can drop from 198 Gbps to 0.84 Gbps across eight datacenters. That is not a small optimization; it is a different assumption about what counts as usable training infrastructure.
The GoogleDeepMind account usually uses X to point to research papers, model work, and infrastructure milestones, and this post is clearly in the infrastructure bucket. The next thing to watch is whether Decoupled DiLoCo stays a Gemma-era research result or becomes part of larger production training runs. If it scales beyond the demo numbers, it could reshape how frontier labs think about stranded compute, chip heterogeneity, and failure tolerance.
Related Articles
Anthropic said it has signed a new agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity that will begin coming online in 2027. The company framed it as its largest compute commitment so far, tied to surging Claude demand and a rapid jump in large enterprise customers.
Anthropic said on April 6, 2026 that it secured multiple gigawatts of next-generation TPU capacity from Google and Broadcom starting in 2027. The deal pairs infrastructure scale with surging demand, as run-rate revenue has passed $30 billion and million-dollar customers have doubled since February.
Cerebras is taking another run at public markets after its 2024 IPO effort was delayed and withdrawn. TechCrunch reports the AI chip startup logged $510M in 2025 revenue and has demand signals tied to AWS data centers and an OpenAI deal reportedly worth more than $10B.
Comments (0)
No comments yet. Be the first to comment!