DeepMind's Decoupled DiLoCo keeps frontier training alive through failures
Original: Decoupled DiLoCo: A new frontier for resilient, distributed AI training View original →
Training frontier models across multiple regions has a basic weakness: one slow or broken cluster can drag the whole run. In a new Google DeepMind post, the company says Decoupled DiLoCo attacks that problem by breaking a training job into separate learner units that exchange updates asynchronously instead of forcing every chip into lockstep.
That architecture matters because synchronization has been the tax that made globally distributed training hard to justify. DeepMind says its eight-datacenter setup cuts required cross-site bandwidth from 198 Gbps to 0.84 Gbps. In simulated runs with 1.2 million chips and high failure rates, it reports 88% goodput for Decoupled DiLoCo versus 27% for standard data-parallel training, while benchmark accuracy stays effectively flat at 64.1% versus 64.4%.
The operational claim is just as important as the benchmark story. DeepMind says it used chaos engineering to inject hardware failures during training, including the loss of entire learner units. The system kept training, then brought those units back into the run when they recovered. That is a different promise from traditional tightly coupled jobs, where a failure can stall or waste a huge amount of expensive compute.
The team also says it trained a 12 billion parameter model across four U.S. regions using 2-5 Gbps wide-area links, and did so more than 20 times faster than conventional synchronization methods. Because the approach can mix hardware generations, including TPU v6e and TPU v5p, it also points to a practical way to use older chips instead of waiting for a perfectly matched cluster.
The broader takeaway is not just that DeepMind found a better networking trick. If Decoupled DiLoCo works at production scale, it changes what counts as usable training infrastructure: stranded compute in other regions, mixed hardware fleets, and ordinary inter-datacenter connectivity start to look less like constraints and more like deployable capacity. For labs chasing larger models, that could be one of the most important infrastructure shifts of the year.
Related Articles
Google DeepMind said on March 26, 2026 that Gemini 3.1 Flash Live is rolling out in preview via the Live API in Google AI Studio. Google’s blog says the model is designed for real-time voice and vision agents, improves tool triggering in noisy environments, and supports more than 90 languages for multimodal conversations.
Google DeepMind has introduced Gemma 4 as a new open-model family built from Gemini 3 research. The lineup spans E2B and E4B edge models through 26B and 31B local-workstation models, with function calling, multimodal reasoning, and 140-language support at the center of the release.
A post in r/artificial pointed readers to Google DeepMind's Gemma 4 release, which packages advanced reasoning and agentic features under Apache 2.0. Google says the family spans four sizes, supports up to 256K context in larger models, and ships with day-one ecosystem support from Hugging Face to llama.cpp.
Comments (0)
No comments yet. Be the first to comment!