InfiniBand for AI Clusters: Why Ethernet Isn't Enough (Yet)
Every major AI training cluster runs InfiniBand. The $5 trillion question is whether Ultra Ethernet will change that. The honest answer: not yet, and here's exactly where the gap is.
I don't train models. I build optical fabric for the people who do, which means I get a front-row seat to which interconnect the workload actually demands. The same question keeps coming up — why is the AI workload allergic to Ethernet, and is that about to change. Short answer first, then the data.
Every major AI training cluster — GPT-4, Gemini, Llama 3, Mistral Large — was trained on a network running InfiniBand. Not Ethernet. Not even 400G Ethernet. InfiniBand. Understanding why requires understanding what large-scale distributed training actually demands from a network, and why the requirements differ from everything else in a datacenter.
Training a large transformer model involves thousands of GPUs exchanging gradients continuously. The communication pattern is all-to-all collective operations: AllReduce, AllGather, ReduceScatter. Every GPU must communicate with every other GPU, synchronously, with microsecond-level latency. A single slow link stalls the entire training step.
The key metrics for AI fabric networks:
InfiniBand was designed for HPC collective operations. RDMA (Remote Direct Memory Access) is native to the protocol — zero kernel involvement, zero copy. Latency from GPU memory to GPU memory across a single hop: 600–800 nanoseconds on NDR 400G InfiniBand. The adaptive routing built into NVIDIA's Quantum-2 and Quantum-X switches rebalances traffic across paths within microseconds, preventing congestion buildup during collective operations.
NDR InfiniBand (400 Gb/s per port, 2023) and XDR (800 Gb/s, 2024 sampling) give NVIDIA a headroom lead over current Ethernet standards. A 400G IB port in a training cluster carries more useful traffic per second than a 400G Ethernet port because IB's lossless transport eliminates the retransmission overhead that Ethernet's congestion control introduces.
Standard Ethernet is not lossless. Congestion causes packet drops. Packet drops in an AllReduce operation cause the slowest retransmission in the collective to stall all other GPUs waiting for that gradient. In a 4,096-GPU cluster, one stalled GPU stops 4,095 others from proceeding. The sensitivity is quadratic — more GPUs means more exposure to any single failure.
RoCE (RDMA over Converged Ethernet) adds RDMA capability to Ethernet but requires a lossless Ethernet fabric using Priority Flow Control (PFC) or ECN-based congestion management. PFC in large clusters creates pause frame propagation that can cascade across the fabric — a known operational challenge called "PFC deadlock" that requires careful topology design and traffic class isolation to manage.
The Ultra Ethernet Consortium (AMD, Arista, Broadcom, Cisco, Intel, Meta, Microsoft) is defining a new Ethernet transport layer specifically for AI workloads. Key changes: packet spraying across multiple paths (eliminating single-path congestion), end-to-end congestion notification without PFC dependency, and hardware-accelerated collective operations at the NIC.
Ultra Ethernet spec v1.0 published in 2024. Silicon in sampling in 2025. Volume production: 2026–2027. The question is whether it closes the latency gap with native InfiniBand on large-scale collective operations — and that data doesn't exist in production yet.
New large-scale AI training clusters (1,000+ GPUs): InfiniBand, specifically NVIDIA Quantum-X or NDR. The proven interoperability with NVIDIA's NVLink fabric and NCCL collective communication library eliminates a significant integration risk that Ultra Ethernet doesn't yet have a track record on.
Inference clusters and smaller training runs (under 512 GPUs): 400G Ethernet with RoCE is viable and significantly cheaper. The collective communication patterns are less sensitive to microsecond latency at smaller scale, and Ethernet's operational familiarity reduces deployment risk.
The fork in the road happens around 2027: if Ultra Ethernet delivers on its latency targets with production silicon and NCCL integration, the InfiniBand moat narrows significantly. NVIDIA's vertical integration advantage (GPUs + NVLink + InfiniBand + NCCL + CUDA) is powerful but not permanent if the interconnect layer becomes commoditized. Watch the UEC production deployment data carefully — that's where the real answer will come from.