Post by Broadcom

660,239 followers

Networking for 100,000+ GPUs is no longer a bottleneck. Broadcom, in collaboration with AMD, Intel, Microsoft, NVIDIA, and OpenAI, has developed Multipath Reliable Connection (MRC) to ensure predictable performance for frontier model training. By making the network behave like a single non-blocking switch, MRC eliminates the "failure amplifier" effect in synchronous AI training. This enhancement to RoCEv2 addresses scale through three critical shifts: ➡️ Multi-plane Topologies: We split 800Gb/s interfaces into multiple 100Gb/s planes, increasing path diversity to connect 128K XPUs with only two tiers of Ethernet switches. ➡️ Adaptive Packet Spraying: MRC sprays packets across hundreds of paths simultaneously, utilizing every available link and eliminating congestion hotspots. ➡️ SRv6 Source Routing: By embedding paths directly in destination addresses, MRC bypasses failures in microseconds without the seconds-long convergence delays of dynamic routing. This breakthrough is a testament to what happens when the industry’s leaders come together to solve the hardest problems in AI infrastructure. MRC is supported on Broadcom’s Thor Ultra NICs and Tomahawk 5 and 6 switches. Read the full technical breakdown here: https://brcm.tech/4cRiuP0

Post content