Post by Google DeepMind

1,534,905 followers

Decoupled DiLoCo is our latest approach helping train AI models across multiple distant data centers. This process normally relies on identical chips staying in near-perfect synchronization. If a single chip fails, the entire training run can stall. With Decoupled DiLoCo, we explored a way to train across a global network. Here are some of our results: 🔘 We trained a 12B parameter model simultaneously across four US regions - so we are no longer constrained by the size of a single centre. 🔘 The system seamlessly mixes older and newer chip generations without slowing down, unlocking more value from existing hardware. 🔘 If hardware breaks mid-run, it isolates the failure and keeps training. We look forward to continuing to evolve our systems into more resilient, useful tools - helping us develop the next generation of AI. Find out more → https://goo.gle/4mNE36q