DeepMind is aiming at a stubborn systems problem: one slow or broken learner can still stall an entire pretraining run. The paper claims competitive model quality with strictly zero global downtime in failure-prone simulations spanning millions of chips.
#llm-training
RSS FeedTraining a frontier model across far-flung data centers usually means paying a brutal synchronization tax. DeepMind says Decoupled DiLoCo cuts cross-site bandwidth from 198 Gbps to 0.84 Gbps in its eight-datacenter setup while holding benchmark ML accuracy near baseline at 64.1%.
MegaTrain proposes training 100B+ parameter LLMs at full precision on a single GPU by keeping parameters and optimizer states in host memory and streaming layers through the device. The recent Hacker News interest is notable because the paper reframes the problem as one of memory-system design rather than simple GPU count.
Anthropic said on February 23, 2026 that DeepSeek, Moonshot AI, and MiniMax carried out industrial-scale distillation attacks against Claude. The company framed model-output extraction as a security and platform integrity problem, not just a competitive concern.
A LocalLLaMA post pointed to a new Hugging Face dataset of human-written code reviews, pairing before-and-after code changes with inline reviewer comments and negative examples across 37 languages.
Andrej Karpathy has published autoresearch, a minimal repo that lets AI agents iterate on a stripped-down nanochat training loop overnight. The project turns agent evaluation into a closed-loop research workflow with fixed 5-minute runs, Git branches, and validation-loss-based selection.
A March 4, 2026 Hacker News thread elevated Q Labs’ Slowrun benchmark, which fixes training data at 100M FineWeb tokens and optimizes for data efficiency under large compute budgets.
A high-engagement r/singularity post pointed to arXiv 2602.15322, which reports that masked adaptive updates and the proposed Magma optimizer can improve 1B-model perplexity versus Adam and Muon with minimal overhead.