Nvidia Blackwell Ahead in AI Inference, AMD Second

Samuel K. Moore

from IEEE Spectrum on 2025-04-02 15:00 (#6WBPV)

black-graphics-cards-on-a-yellow-background.jpg?id=59778845&width=1200&height=800&coordinates=187%2C0%2C188%2C0

In the latest round of machine learning benchmark results from MLCommons, computers built around Nvidia's new Blackwell GPU architecture outperformed all others. But AMD's latest spin on its Instinct GPUs, the MI325, proved a match for the Nvidia H200, the product it was meant to counter. The comparable results were mostly on tests of one of the smaller-scale large language models, Llama2 70B (for 70 billion parameters). However, in an effort to keep up with a rapidly changing AI landscape, MLPerf added three new benchmarks to better reflect where machine learning is headed.

MLPerf runs benchmarking for machine learning systems in an effort to provide an apples-to-apples comparison between computer systems. Submitters use their own software and hardware, but the underlying neural networks must be the same. There are a total of 11 benchmarks for servers now, with three added this year.

It has been hard to keep up with the rapid development of the field," says Miro Hodak, the cochair of MLPerf Inference. ChatGPT appeared only in late 2022, OpenAI unveiled its first large language model (LLM) that can reason through tasks last September, and LLMs have grown exponentially-GPT3 had 175 billion parameters, while GPT4 is thought to have nearly 2 trillion. As a result of the breakneck innovation, we've increased the pace of getting new benchmarks into the field," says Hodak.

The new benchmarks include two LLMs. The popular and relatively compact Llama2 70B is already an established MLPerf benchmark, but the consortium wanted something that mimicked the responsiveness people are expecting of chatbots today. So the new benchmark Llama2-70B Interactive" tightens the requirements. Computers must produce at least 25 tokens per second under any circumstance and cannot take more than 450 milliseconds to begin an answer.

Seeing the rise of agentic AI"-networks that can reason through complex tasks-MLPerf sought to test an LLM that would have some of the characteristics needed for that. They chose Llama3.1 405B for the job. That LLM has what's called a wide context window. That's a measure of how much information-documents, samples of code, et cetera-it can take in at once. For Llama3.1 405B, that's 128,000 tokens, more than 30 times as much as Llama2 70B.

The final new benchmark, called RGAT, is what's called a graph attention network. It acts to classify information in a network. For example, the dataset used to test RGAT consists of scientific papers, which all have relationships between authors, institutions, and fields of study, making up 2 terabytes of data. RGAT must classify the papers into just under 3,000 topics.

Blackwell, Instinct Results

Nvidia continued its domination of MLPerf benchmarks through its own submissions and those of some 15 partners, such as Dell, Google, and Supermicro. Both its first- and second-generation Hopper architecture GPUs-the H100 and the memory-enhanced H200-made strong showings. We were able to get another 60 percent performance over the last year" from Hopper, which went into production in 2022, says Dave Salvator, director of accelerated computing products at Nvidia. It still has some headroom in terms of performance."

But it was Nvidia's Blackwell architecture GPU, the B200, that really dominated. The only thing faster than Hopper is Blackwell," says Salvator. The B200 packs in 36 percent more high-bandwidth memory than the H200, but, even more important, it can perform key machine learning math using numbers with a precision as low as 4 bits instead of the 8 bits Hopper pioneered. Lower-precision compute units are smaller, so more fit on the GPU, which leads to faster AI computing.

In the Llama3.1 405B benchmark, an eight-B200 system from Supermicro delivered nearly four times the tokens per second of an eight-H200 system by Cisco. And the same Supermicro system was three times as fast as the quickest H200 computer at the interactive version of Llama2 70B.

Nvidia used its combination of Blackwell GPUs and Grace CPU, called GB200, to demonstrate how well its NVL72 data links can integrate multiple servers in a rack, so they perform as if they were one giant GPU. In an unverified result the company shared with reporters, a full rack of GB200-based computers delivers 869,200 tokens per second on Llama2 70B. The fastest system reported in this round of MLPerf was an Nvidia B200 server that delivered 98,443 tokens per second.

AMD is positioning its latest Instinct GPU, the MI325X, as providing performance competitive with Nvidia's H200. MI325X has the same architecture as its predecessor, MI300, but it adds even more high-bandwidth memory and memory bandwidth-256 gigabytes and 6 terabytes per second (a 33 percent and 13 percent boost, respectively).

Adding more memory is a play to handle larger and larger LLMs. Larger models are able to take advantage of these GPUs because the model can fit in a single GPU or a single server," says Mahesh Balasubramanian, director of data-center GPU marketing at AMD. So you don't have to have that communication overhead of going from one GPU to another GPU or one server to another server. When you take out those communications, your latency improves quite a bit." AMD was able to take advantage of the extra memory through software optimization to boost the inference speed of DeepSeek-R1 eightfold.

On the Llama2 70B test, an eight-GPU MI325X computers came within 3 to 7 percent the speed of a similarly tricked-out H200-based system. And on image generation the MI325X system was within 10 percent of the Nvidia H200 computer.

AMD's other noteworthy mark this round was from its partner, Mangoboost, which showed nearly fourfold performance on the Llama2 70B test by doing the computation across four computers.

Intel has historically put forth CPU-only systems in the inference competition to show that for some workloads you don't really need a GPU. This time around saw the first data from Intel's Xeon 6 chips, which were formerly known as Granite Rapids and are made using Intel's 3-nanometer process. At 40,285 samples per second, the best image-recognition results for a dual-Xeon 6 computer was about one-third the performance of a Cisco computer with two Nvidia H100s.

Compared with Xeon 5 results from October 2024, the new CPU provides about an 80 percent boost on that benchmark and an even bigger boost on object detection and medical imaging. Since it first started submitting Xeon results in 2021 (the Xeon 3), the company has achieved an elevenfold boost in performance on Resnet.

For now, it seems Intel has quit the field in the AI accelerator-chip battle. Its alternative to the Nvidia H100, Gaudi 3, did not make an appearance in the new MLPerf results, nor in version 4.1, released last October. Gaudi 3 got a later-than-planned release because its software was not ready. In the opening remarks at Intel Vision 2025, the company's invite-only customer conference, newly minted CEO Lip-Bu Tan seemed to apologize for Intel's AI efforts. I'm not happy with our current position," he told attendees. You're not happy either. I hear you loud and clear. We are working toward a competitive system. It won't happen overnight, but we will get there for you."

Google's TPU v6e chip also made a showing, though the results were restricted to the image-generation task. At 5.48 queries per second, the 4-TPU system saw a 2.5-times boost over a similar computer using its predecessor TPU v5e in the October 2024 results. Even so, 5.48 queries per second was roughly in line with a similarly sized Lenovo computer using Nvidia H100s.

This post was corrected on 2 April 2025 to give the right value for high-bandwidth memory in the MI325X. It was corrected again on 7 April, to make the chart easier to read.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/IeeeSpectrum
Feed Title	IEEE Spectrum
Feed Link	https://spectrum.ieee.org/