Article 6QAAW AI Inference Competition Heats Up

AI Inference Competition Heats Up

by
Dina Genkina
from IEEE Spectrum on (#6QAAW)
stacked-pieces-of-wires-and-black-boxes-on-a-black-background.jpg?id=53560244&width=600&height=600&coordinates=312%2C0%2C313%2C0

While the dominance of Nvidia GPUs for AI training remains undisputed, we may be seeing early signs that, for AI inference, the competition is gaining on the tech giant, particularly in terms of power efficiency. The sheer performance of Nvidia's new Blackwell chip, however, may be hard to beat.

This morning, ML Commons released the results of its latest AI inferencing competition, ML Perf Inference v4.1. This round included first-time submissions from teams using AMD Instinct accelerators, the latest Google Trillium accelerators, chips from Toronto-based startup UntetherAI, as well as a first trial for Nvidia's new Blackwell chip. Two other companies, Cerebras and FuriosaAI, announced new inference chips but did not submit to MLPerf.

Much like an Olympic sport, MLPerf has many categories and subcategories. The one that saw the biggest number of submissions was the datacenter-closed" category. The closed category (as opposed to open) requires submitters to run inference on a given model as-is, without significant software modification. The data center category tests submitters on bulk processing of queries, as opposed to the edge category, where minimizing latency is the focus.


thumbnail


Within each category, there are 9 different benchmarks, for different types of AI tasks. These include popular use cases such as image generation (think Midjourney) and LLM Q&A (think ChatGPT), as well as equally important but less heralded tasks such as image classification, object detection, and recommendation engines.

This round of the competition included a new benchmark, called Mixture of Experts. This is a growing trend in LLM deployment, where a language model is broken up into several smaller, independent language models, each fine-tuned for a particular task, such as regular conversation, solving math problems, and assisting with coding. The model can direct each query to an appropriate subset of the smaller models, or experts". This approach allows for less resource use per query, enabling lower cost and higher throughput, says Miroslav Hodak, MLPerf Inference Workgroup Chair and senior member of technical staff at AMD.

The winners on each benchmark within the popular datacenter-closed benchmark were still submissions based on Nvidia's H200 GPUs and GH200 superchips, which combine GPUs and CPUs in the same package. However, a closer look at the performance results paint a more complex picture. Some of the submitters used many accelerator chips while others used just one. If we normalize the number of queries per second each submitter was able to handle by the number of accelerators used, and keep only the best performing submissions for each accelerator type, some interesting details emerge. (It's important to note that this approach ignores the role of CPUs and interconnects.)

On a per accelerator basis, Nvidia's Blackwell outperforms all previous chip iterations by 2.5x on the LLM Q&A task, the only benchmark it was submitted to. Untether AI's speedAI240 Preview chip performed almost on-par with H200's in its only submission task, image recognition. Google's Trillium performed just over half as well as the H100 and H200s on image generation, and AMD's Instinct performed about on-par with H100s on the LLM Q&A task.


thumbnail


The power of Blackwell

One of the reasons for Nvidia Blackwell's success is its ability to run the LLM using 4-bit floating-point precision. Nvidia and its rivals have been driving down the number of bits used to represent data in portions of transformer models like ChatGPT to speed computation. Nvidia introduced 8-bit math with the H100, and this submission marks the first demonstration of 4-bit math on MLPerf benchmarks.

The greatest challenge with using such low-precision numbers is maintaining accuracy, says Nvidia's product marketing director Dave Salvator. To maintain the high accuracy required for MLPerf submissions, the Nvidia team had to innovate significantly on software, he says.

Another important contribution to Blackwell's success is it's almost doubled memory bandwidth, 8 terabytes/second, compared to H200's 4.8 terabytes/second.

a-black-box-with-gold-and-rainbow-squares-on-top-against-a-black-background.jpg?id=53560243&width=980Nvidia GB2800 Grace Blackwell SuperchipNvidia

Nvidia's Blackwell submission used a single chip, but Salvator says it's built to network and scale, and will perform best when combined with Nvidia's NVLink interconnects. Blackwell GPUs support up to 18 NVLink 100 gigabyte-per-second connections for a total bandwidth of 1.8 terabytes per second, roughly double the interconnect bandwidth of H100s.

Salvatore argues that with the increasing size of large language models, even inferencing will require multi-GPU platforms to keep up with demand, and Blackwell is built for this eventuality. Blackwell is a platform," Salvator says.

Nvidia submitted their Blackwell chip-based system in the preview subcategory, meaning it is not for sale yet but is expected to be available before the next MLPerf release, six months from now.

Untether AI shines in power use and at the edge

For each benchmark, MLPerf also includes an energy measurement counterpart, which systematically tests the wall plug power that each of the systems draws while performing a task. The main event (the datacenter-closed energy category) saw only two submitters this round: Nvidia and Untether AI. While Nvidia competed in all the benchmarks, Untether only submitted for image recognition.

Submitter

Accelerator

Number of accelerators

Queries per second

Watts

Queries per second per Watt

NVIDIA

NVIDIA H200-SXM-141GB

8

480,131.00

5,013.79

95.76

UntetherAI

UntetherAI speedAI240 Slim

6

309,752.00

985.52

314.30

The startup was able to achieve this impressive efficiency by building chips with an approach it calls at-memory computing. UntetherAI's chips are built as a grid of memory elements with small processors interspersed directly adjacent to them. The processors are parallelized, each working simultaneously with the data in the nearby memory units, thus greatly decreasing the amount of time and energy spent shuttling model data between memory and compute cores.

What we saw was that 90 percent of the energy to do an AI workload is just moving the data from DRAM onto the cache to the processing element," says Untether AI vice president of product Robert Beachler. So what Untether did was turn that around ... Rather than moving the data to the compute, I'm going to move the compute to the data."

This approach proved particularly successful in another subcategory of MLPerf: edge-closed. This category is geared towards more on-the-ground use cases, such as machine inspection on the factory floor, guided vision robotics, and autonomous vehicles-applications where low energy use and fast processing are paramount, Beachler says.

Submitter

GPU type

Number of GPUs

Single Stream Latency (ms)

Multi-Stream Latency (ms)

Samples/s

Lenovo

NVIDIA L4

2

0.39

0.75

25,600.00

Lenovo

NVIDIA L40S

2

0.33

0.53

86,304.60

UntetherAI

UntetherAI speedAI240 Preview

2

0.12

0.21

140,625.00

On the image recognition task, again the only one UntetherAI reported results for, the speedAI240 Preview chip beat NVIDIA L40S's latency performance by 2.8x and its throughput (samples per second) by 1.6x. The startup also submitted power results in this category, but their Nvidia-accelerated competitors did not, so it is hard to make a direct comparison. However, the nominal power draw per chip for UntetherAI's speedAI240 Preview chip is 150 Watts, while for Nvidia's L40s it is 350 W, leading to a nominal 2.3x power reduction with improved latency.

Cerebras, Furiosa skip MLPerf but announce new chips

a-black-box-with-white-boxes.jpg?id=53560241&width=980Furiosa's new chip implements the basic mathematical function of AI inference, matrix multiplication, in a different, more efficient way. Furiosa

Yesterday at the IEEE Hot Chips conference at Stanford, Cerebras unveiled its own inference service. The Sunnyvale, Calif. company makes giant chips, as big as a silicon wafer will allow, thereby avoiding interconnects between chips and vastly increasing the memory bandwidth of their devices, which are mostly used to train massive neural networks. Now it has upgraded its software stack to use its latest computer CS3 for inference.

Although Cerebras did not submit to MLPerf, the company claims its platform beats an H100 by 7x and competing AI startup Groq's chip by 2x in LLM tokens generated per second. Today we're in the dial up era of Gen AI," says Cerebras CEO and cofounder Andrew Feldman. And this is because there's a memory bandwidth barrier. Whether it's an H100 from Nvidia or MI 300 or TPU, they all use the same off chip memory, and it produces the same limitation. We break through this, and we do it because we're wafer-scale."

Hot Chips also saw an announcement from Seoul-based Furiosa, presenting their second-generation chip, RNGD (pronounced renegade"). What differentiates Furiosa's chip is its Tensor Contraction Processor (TCP) architecture. The basic operation in AI workloads is matrix multiplication, normally implemented as a primitive in hardware. However, the size and shape of the matrixes, more generally known as tensors, can vary widely. RNGD implements multiplication of this more generalized version, tensors, as a primitive instead. During inference, batch sizes vary widely, so its important to utilize the inherent parallelism and data re-use from a given tensor shape," Furiosa founder and CEO June Paik said at Hot Chips.

Although it didn't submit to MLPerf, Furiosa compared the performance of its RNGD chip on MLPerf's LLM summarization benchmark in-house. It performed on-par with Nvidia's edge-oriented L40S chip while using only 185 Watts of power, compared to L40S's 320 W. And, Paik says, the performance will improve with further software optimizations.

IBM also announced their new Spyre chip designed for enterprise generative AI workloads, to become available in the first quarter of 2025.

At least, shoppers on the AI inference chip market won't be bored for the foreseeable future.

External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/IeeeSpectrum
Feed Title IEEE Spectrum
Feed Link https://spectrum.ieee.org/
Reply 0 comments