InfiniBand for AI Clusters: Why Ethernet Isn't Enough (Yet)
Every major AI training cluster runs InfiniBand. The $5 trillion question is whether Ultra Ethernet will change that. The honest answer: not yet, and here's exactly where the gap is.
Every major AI training cluster runs InfiniBand. The $5 trillion question is whether Ultra Ethernet will change that. The honest answer: not yet, and here's exactly where the gap is.
A 400G coherent port draws 16–20W. At 64 ports per chassis: 1.3kW just for optics. DSP power is a real constraint in dense deployments — and most capacity plans ignore it.
An OTDR finds breaks, splices, and macrobends. It tells you nothing about the cause of 80% of optical link failures: contaminated connectors. Here's what actual fiber testing looks like — and how an SFP-based microOTDR changes the economics.
Most modern high-speed DSP-based pluggable optical modules contain a DSP that re-clocks and reshapes the electrical signal. That DSP can consume a significant part of the module power budget. For
Your data center runs web applications. Your procurement team buys optics for web application traffic. Then someone installs a GPU cluster, and the optics team discovers that AI fabric traffic
Co-Packaged Optics will cut power consumption in half. CPO will eliminate transceiver inventory headaches. CPO will redefine data center design. You have heard the pitch. The pitch skips the parts
Optical components are often compared based on unit price. This does not reflect the total cost. The most cost-effective component is the one that integrates without requiring ongoing effort to keep it stable.
Lab validation confirms that a configuration can work under ideal conditions. It does not guarantee that it will behave the same way in production. Understanding that limitation is critical.
The design is complete. Procurement comes back: 400ZR+ optics are not available for four to five months. That is not a supply chain problem. It is a design assumption problem.
A $350 optic turned into an $18,000 problem. Not because the optic failed — because nobody cleaned the connector.