Article 72QPD Nvidia’s New Rubin Architecture Thrives on Networking

Nvidia’s New Rubin Architecture Thrives on Networking

by
Dina Genkina
from IEEE Spectrum on (#72QPD)
nvidia-s-vera-rubin-ai-supercomputer-made-up-of-several-racks-of-cpus-and-gpus.jpg?id=62697156&width=1200&height=800&coordinates=62%2C0%2C63%2C0

Earlier this week, Nvidia surprise-announced their new Vera Rubin architecture (no relation to the recently unveiled telescope) at the Consumer Electronics Show in Las Vegas. The new platform, set to reach customers later this year, is advertised to offer a ten-fold reduction in inference costs and a four-fold reduction in how many GPUs it would take to train certain models, as compared to Nvidia's Blackwell architecture.

The usual suspect for improved performance is the GPU. Indeed, the new Rubin GPU boasts 50 quadrillion floating-point operations per second (petaFLOPS) of 4-bit computation, as compared to 10 petaflops on Blackwell, at least for transformer-based inference workloads like large language models.

However, focusing on just the GPU misses the bigger picture. There are a total of six new chips in the Vera-Rubin-based computers: the Vera CPU, the Rubin GPU, and four distinct networking chips. To achieve performance advantages, the components have to work in concert, says Gilad Shainer, senior vice president of networking at Nvidia.

The same unit connected in a different way will deliver a completely different level of performance," Shainer says. That's why we call it extreme co-design."

Expanded in-network compute"

AI workloads, both training and inference, run on large numbers of GPUs simultaneously. Two years back, inferencing was mainly run on a single GPU, a single box, a single server," Shainer says. Right now, inferencing is becoming distributed, and it's not just in a rack. It's going to go across racks."

To accommodate these hugely distributed tasks, as many GPUs as possible need to effectively work as one. This is the aim of the so-called scale-up network: the connection of GPUs within a single rack. Nvidia handles this connection with their NVLink networking chip. The new line includes the NVLink6 switch, with double the bandwidth of the previous version (3,600 gigabytes per second for GPU-to-GPU connections, as compared to 1,800 GB/s for NVLink5 switch).

In addition to the bandwidth doubling, the scale-up chips also include double the number of SerDes-serializer/deserializers (which allow data to be sent across fewer wires) and an expanded number of calculations that can be done within the network.

The scale-up network is not really the network itself," Shainer says. It's computing infrastructure, and some of the computing operations are done on the network...on the switch."

The rationale for offloading some operations from the GPUs to the network is two-fold. First, it allows some tasks to only be done once, rather than having every GPU having to perform them. A common example of this is the all-reduce operation in AI training. During training, each GPU computes a mathematical operation called a gradient on its own batch of data. In order to train the model correctly , all the GPUs need to know the average gradient computed across all batches. Rather than each GPU sending its gradient to every other GPU, and every one of them computing the average, it saves computational time and power for that operation to only happen once, within the network.

A second rationale is to hide the time it takes to shuttle data in-between GPUs by doing computations on them en-route. Shainer explains this via an analogy of a pizza parlor trying to speed up the time it takes to deliver an order. What can you do if you had more ovens or more workers? It doesn't help you; you can make more pizzas, but the time for a single pizza is going to stay the same. Alternatively, if you would take the oven and put it in a car, so I'm going to bake the pizza while traveling to you, this is where I save time. This is what we do."

In-network computing is not new to this iteration of Nvidia's architecture. In fact, it has been in common use since around 2016. But, this iteration adds a broader swath of computations that can be done within the network to accommodate different workloads and different numerical formats, Shainer says.

Scaling out and across

The rest of the networking chips included in the Rubin architecture comprise the so-called scale-out network. This is the part that connects different racks to each other within the data center.

Those chips are the ConnectX-9, a networking interface card; the BlueField-4 a so-called data processing unit, which is paired with two Vera CPUs and a ConnectX-9 card for offloading networking, storage, and security tasks; and finally the Spectrum-6 Ethernet switch, which uses co-packaged optics to send data between racks. The Ethernet switch also doubles the bandwidth of the previous generations, while minimizing jitter-the variation in arrival times of information packets.

Scale-out infrastructure needs to make sure that those GPUs can communicate well in order to run a distributed computing workload and that means I need a network that has no jitter in it," he says. The presence of jitter implies that if different racks are doing different parts of the calculation, the answer from each will arrive at different times. One rack will always be slower than the rest, and the rest of the racks, full of costly equipment, sit idle while waiting for that last packet. Jitter means losing money," Shainer says.

None of Nvidia's host of new chips are specifically dedicated to connect between data centers, termed scale-across." But Shainer argues this is the next frontier. It doesn't stop here, because we are seeing the demands to increase the number of GPUs in a data center," he says. 100,000 GPUs is not enough anymore for some workloads, and now we need to connect multiple data centers together."

External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/IeeeSpectrum
Feed Title IEEE Spectrum
Feed Link https://spectrum.ieee.org/
Reply 0 comments