
Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use.Chapters00:00 Intro00:39 Greg and Mark's paths to OpenAI04:34 Why training AI stresses networks differently10:05 Bottlenecks, failures, and the cost of waiting15:19 How Multipath Reliable Connection works18:59 A protocol to route around failures25:05 Why OpenAI is making MRC an open standard35:09 Could AI compute move to space? Hosted on Acast. See acast.com/privacy for more information.
Podzilla Summary coming soon
Sign up to get notified when the full AI-powered summary is ready.
Free forever for up to 3 podcasts. No credit card required.

How a reasoning model cracked an 80-year-old math problem - Episode 20

Episode 19 - Inside image generation’s Renaissance moment

Episode 17 - What happens now that AI is good at math?

Episode 16 - Building AI for Life Sciences
Free AI-powered recaps of OpenAI Podcast and your other favorite podcasts, delivered to your inbox.
Free forever for up to 3 podcasts. No credit card required.