Supercomputing’s Future Is Green and Interconnected

Dina Genkina

from IEEE Spectrum on 2024-03-20 16:58 (#6KG94)

a-black-wall-with-a-cage-like-metal-piec

While the Top500 list ranks the 500 biggest high-performance computers (HPCs) in the world, its cousin the Green500 reranks the same 500 supercomputers according to their energy efficiency. For the last three iterations of the list, Henri-a small supercomputer operated by the Flatiron Institute, in New York City-has been named the world's most energy-efficient high-performance computer. Built in the fall of 2022, Henri was the first system to use Nvidia's H100 GPUs, a.k.a Hopper.

To learn the secrets of building and maintaining the most energy-efficient supercomputer, we caught up with Henri's architect, Ian Fisk, who is codirector of the Scientific Computing Core at the Flatiron Institute. Flatiron is an internal research division of the Simons Foundation that brings together researchers using modern computational tools to advance our understanding of science.

The Flatiron Insitute's Ian Fisk on...

IEEE Spectrum: Where did the name Henri come from?

Ian Fisk: The name came about for a silly reason. Our previous machine was called Rusty. So, when asked by the vendor what the machine name was going to be, we said, Well, by our naming convention, it'll be Rusty, and it's using [Nvidia's] H100 chip, so it'd be Rusty Hopper.' But Rusty Hopper sounds like a country singer from the 1980s, so they didn't want to call it that. And one of the Nvidia engineers who decided that you might be able to actually build a machine that would make the Top500 and be the top of the Green500 had just had a son named Henri. So, we were asked by the vendor if we might consider naming it after that person, which we thought was sweet.

Since the Green500 measures performance per watt, it doesn't matter how fast you are, it matters how fast you are for how many watts you used. -Ian Fisk, Flatiron Institute

Did you set out to build the world's greenest supercomputer?

Fisk: Nvidia sold us that gear at an educational-discount price in part because we were aiming for this benchmark. It was good for us because it gave us some exposure, but we really wanted the hardware for the scientists, and it was a way for us to get access to H100s very early. But to do that, we had to do the test in November 2022. So the equipment came to the loading dock in October, and it was assembled into a computer and then tested in record time. If there was an award for the fast 500, we would also be the winner.

My Trip to See Henri

A few weeks ago, I hopped on a train from the bustling Penn Station of New York City to the land of warehouses and outlet malls known as Secaucus, N.J., just 10 minutes away by NJTransit. I was on my way to witness first-hand Henri, the world's greenest supercomputer one and a half years running, according to the Green500 list. On the train over, my guide, Henri's architect and caregiver Ian Fisk told me: Prepare to be disappointed." Fisk is a former research physicist, a breed with which I am familiar, and among whom self-deprecation runs rampant. Knowing this, I did not heed his warning.

After a 10-minute drive in Fisk's self-proclaimed midlife-crisis vehicle, we arrived at an astonishingly cubic building and made our way to the security desk just past the front door. The building, Fisk explained, houses data centers for several financial institutions, so security is of the utmost importance. Henri itself is a less likely target for hackers-it is built and used by the Flatiron Institute, a research facility run by the Simons Foundation that focuses on computational biology, mathematics, quantum physics, neuroscience, and astrophysics. Yet, I had to present ID and get fingerprinted before being allowed to follow Fisk to the data warehouse.

We walked into a huge, brightly lit room with about 10 rows of 12 or so racks each, filled floor to ceiling with computers. The fans were making so much noise I couldn't quite make out anything Fisk was saying, but I gathered that most of the rows were just sharing space with Henri. We walked over to the back two rows, with black racks distinct from the white ones out front. These were run by the Flatiron institute. Henri?" I mouthed. Fisk shook his head no. We walked into one of the rows until Fisk pointed out two undistinguished-looking racks filled with computers. Henri" he nodded. Can't say I wasn't warned.

As we walked out of the room and into a lounge with more favorable acoustics, I was about to learn that Henri's comparatively unimpressive size is no coincidence. In our Q&A, Fisk explained how they achieved their greenest accolade, why they've been able to maintain the top spot, and what the future might hold for this and other supercomputers.

The numbers in the first test run [November 2022] were not as good as the second time [June 2023]. The second time when there was a little bit more time to breathe, we upgraded the machine. It was bigger: it was 80 GPUs the first time and 144 the second time. It's 2.7 petaflops, which for two racks of equipment is a reasonable size. It's around 250 on the Top500 largest supercomputers list. And then No. 1 on the Green500 list.

Can you explain your design decisions when building Henri? Why Nvidia's H100s?

Fisk: Our experience with Nvidia, which goes all the way back to K40s, was that every generation was about two to three times faster than its predecessor. And that was certainly true of all the things that led up to it, like the V100 and the A100. It's about two and a half times better. We already had two racks of A100s, and when it came time to upgrade the facility, H100s were the thing to buy.

The H100 at the time were only available in the PCI-connected version; they didn't have the NV-link option yet. And they didn't have any water-cooled ones, so we were using air-cooled systems again. The GPUs before that machine and after have all been water-cooled systems, because they're just a little bit more efficient, and easier to operate because you can get rid of a lot more heat. But we chose it because we were expecting very nice performance numbers. And we got them, eventually. With Nvidia, the software and the hardware sort of come out at the same time. And the performance tends to get better over time as things get optimized properly.

The thing that separates a computer from a supercomputer is the low-latency fabric. And on almost all systems right now, that low-latency fabric is InfiniBand. The only people who provide it is Mellanox [Technologies], which was recently acquired by the Nvidia Corp., so they own the whole stack.

[What] has allowed us to stand on top has been that technology has evolved to use more power rather than be more efficient. We didn't expect to win more than once. -Ian Fisk, Flatiron Institute

There was one design choice that was sort of thrust upon us that we're revisiting right now. When we bought the system, the only chassis that you could buy were PCI Gen 4, and the H100s use PCI Gen 5. Because it was Gen 4, we were limited by the communication speed to the GPUs and to the InfiniBand cards. When we started, we had HDR cards at 100 gigabits each. And we rapidly discovered that that wasn't going to be sufficient to do a good test for the Green500. So, we upgraded to 400 gigabits of InfiniBand on each node, and that helped some. Had we had PCIe Gen 5, we could have had two times 400 gigabits, and that would have been even better.

What optimizations did you have to do for the Green500 test?

Fisk: I think doing the Green500 run is a little bit like being a hypermiler. You have a Honda Civic and you drive across the country getting 60 miles per gallon with the windows closed, AC off, and accelerating very slowly, but that's not exactly the way you'd drive it in a rush to get somewhere. For instance, when you do the Green500 run, everything that doesn't generate performance is turned down. There are big solid-state drives on all of the systems of this type when you're running in production, because you need to serve training samples to machine-learning applications. But they use power, and they don't give you any performance, so those get turned off. It's a little bit like a hypermiler taking the spare tire out of their car because they wanted to get better mileage, but it's not how they would actually drive it all the time.

How have you been able to keep the No. 1 spot for almost two years?

Fisk: Certainly, the thing that will knock Henri off its perch will be the next generation of hardware. But I think the thing that has allowed us to stand on top has been that technology has evolved to use more power rather than be more efficient. We didn't expect to win more than once. We were expecting that people would come along with the water-cooled version of H100s and be more efficient than us, but that hasn't happened so far.

The H100 comes in two models, the PCI version that plugs into the board as a card and the motherboard mount, it's called an SXM5. And the SXM5 is the NV-linked version. The big difference is that the SXM5 has a communication protocol between the GPUs that allows them to talk to each other at 900 gigabytes per second. It's dramatically better than anything on InfiniBand. It's really what allows them to solve problems like large language models, because when you're doing these kinds of calculations, at each epoch, there can be a tremendous amount of information that has to flow back and forth. So those communication links are very important, but they also use more electricity. The LINPACK benchmark that they do for the Green500 test benefits from a good communication layer, but not at that level.

The reason why no one has beaten the machine yet is that the SXM5s actually use a lot more electricity, they use 700 watts per GPU while ours only use 350, and the performance [on things like the LINPACK test] is not a factor of 2 different. Since the Green500 measures performance per watt, it doesn't matter how fast you are, it matters how fast you are for how many watts you used. And that's the thing that we see with those PCI-connected H100s. They are very hard to beat because they don't use a lot of electricity and they have similar performance to the much higher wattage stuff on these kinds of calculations.

Do you expect to be the greenest supercomputer again in May?

Fisk: Well, we are building a new machine with 96 GPUs. These will be the SXM5s, water-cooled NV-linked devices. We will know soon if they will have better performance. As I mentioned, they may be faster, but they may not be more efficient. But, one thing we found with our A100s was that most of the performance is available in the first half the wattage, so you get 90 percent of the performance in the first 225 watts. So, one of the things that we're going to try with the water-cooled system is to run it in power-capped mode, and see what kind of performance we get.

The future is going to be expensive. And the future is going to be very high powered. -Ian Fisk, Flatiron Institute

One nice thing about the water-cooled version is that it doesn't need fans, because the fans count against your wattage. When these units are running, it's about 4 kilowatts of power per three units of space (3U). So it's like forty 100-watt lightbulbs in a small box. Cooling that down requires blowing a tremendous amount of air across it, so you can have a few hundred watts of fans. And with water cooling, you just have a central pump, which means significant savings. The heat capacity of water is about 4,000 times the heat capacity of air by volume, so you have to use a lot less of it.

It's going to be interesting to see the next Green500 list in May of this year. We'll see who comes along and whether nobody beats us, or somebody beats us, or we beat ourselves. It's all possible.

What does the future look like for Henri or its successor?

Fisk: The future is going to be expensive. And the future is going to be very high powered.

When we started, the GPU was a specialized resource that was very good for machine learning and certain kinds of linear algebra calculations. At the beginning, everyone used a single GPU. Then they started using them together in groups where they would fit their computation across several nodes, up to eight nodes. Now, we're seeing more and more people who want to do tightly connected large language models, where it requires 100 GPUs or several hundreds of GPUs connected in ways that we never would have imagined.

For the next set of resources we're buying, the network connectivity is 16 times better than the ones that came before that. It's a similar set of equipment, but these ones have 1.6 terabits of communication per node, as compared to 100 gigabits. And it makes the machines very expensive, because suddenly the network fabric is a large factor in the purchase price, because you need lots and lots of InfiniBand switches and lots of cables. And these are 800 gigabit-exotic, very high performance cables.

With tightly connected GPUs you can get models that have 10 to the power of 10 parameters. And this is what's really driving that particular technology. -Ian Fisk, Flatiron Institute

We expect there'll be lots of people who are running conventional high-performance computing codes. But now there's this new community that wants to use big chunks of very valuable resources, and we're trying to support those people. It's complicated, in part because we are competing with industries that do this, too. These kinds of resources are very hard to buy. They have long lead times, they're very expensive, in part because it's driven by the AI gold rush that is going on right now. We're trying to figure out our place in that, and so we're buying a medium-scale machine. And we don't know what happens after that.

What is the Flatiron Institute using Henri for?

Fisk: It's a mix. I would say, still 75 or 80 percent is what I would consider canned machine-learning applications. This is PyTorch primarily, where people are building models to make either simulation or prediction of various things, finding correlations. This runs across the whole spectrum. We've got people who are looking at how to understand the AI and build better models. We also have people who are working on things like structural systems biology, looking for correlations of microbiome in the gut. We have people working on protein structure, gene function, looking at gene sequences, and using machine-learning techniques to identify what's going on.

The most recent project is called Polymathic AI. A simplistic summary would be something like ChatGPT for science. The idea is to make a large enough foundation model for science, where you teach the AI algorithms a lot about physical processes, and then ask them to do things like fluid dynamics simulations. It's a very ambitious project. And they're trying to figure out how to get bigger, how to scale up their work. And the idea behind this is that with tightly connected GPUs you can get models that have 10 to the power of 10 parameters. And this is what's really driving that particular technology.

Henri is a workhorse machine. If you go into the queue right now, it's entirely full. If I wanted to run another Green500 test and say: I'm going to take this thing offline for two weeks," I would have a riot on my hands. There would be pitchforks outside my office. So yes, it's a very green efficient computer. But at the end of the day, its legacy is all of the amazing science it enables.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/IeeeSpectrum
Feed Title	IEEE Spectrum
Feed Link	https://spectrum.ieee.org/