Newest Google and Nvidia Chips Speed AI Training

Samuel K. Moore

from IEEE Spectrum on 2024-11-13 16:00 (#6S6JH)

a-stack-of-computer-servers-with-fan-exhaust-ports.jpg?id=54490309&width=1245&height=700&coordinates=0%2C256%2C0%2C256

Nvidia, Oracle, Google, Dell and 13 other companies reported how long it takes their computers to train the key neural networks in use today. Among those results were the first glimpse of Nvidia's next generation GPU, the B200, and Google's upcoming accelerator, called Trillium. The B200 posted a doubling of performance on some tests versus today's workhorse Nvidia chip, the H100. And Trillium delivered nearly a four-fold boost over the chip Google tested in 2023.

The benchmark tests, called MLPerf v4.1, consist of six tasks: recommendation, the pre-training of the large language models (LLM) GPT-3 and BERT-large, the fine tuning of the Llama 2 70B large language model, object detection, graph node classification, and image generation.

Training GPT-3 is such a mammoth task that it'd be impractical to do the whole thing just to deliver a benchmark. Instead, the test is to train it to a point that experts have determined means it is likely to reach the goal if you kept going. For Llama 2 70B, the goal is not to train the LLM from scratch, but to take an already trained model and fine-tune it so it's specialized in a particular expertise-in this case, government documents. Graph node classification is a type of machine learning used in fraud detection and drug discovery.

As what's important in AI has evolved, mostly toward using generative AI, the set of tests has changed. This latest version of MLPerf marks a complete changeover in what's being tested since the benchmark effort began. At this point all of the original benchmarks have been phased out," says David Kanter, who leads the benchmark effort at MLCommons. In the previous round it was taking mere seconds to perform some of the benchmarks.

a-line-graph-with-one-diagonal-blue-line-and-many-colored-and-dashed-branches-rising-up-from-that-line.png?id=54490324&width=980 Performance of the best machine learning systems on various benchmarks has outpaced what would be expected if gains were solely from Moore's Law [blue line]. Solid line represent current benchmarks. Dashed lines represent benchmarks that have now been retired, because they are no longer industrially relevant.MLCommons

According to MLPerf's calculations, AI training on the new suite of benchmarks is improving at about twice the rate one would expect from Moore's Law. As the years have gone on, results have plateaued more quickly than they did at the start of MLPerf's reign. Kanter attributes this mostly to the fact that companies have figured out how to do the benchmark tests on very large systems. Over time, Nvidia, Google, and others have developed software and network technology that allows for near linear scaling-doubling the processors cuts training time roughly in half.

https://public.flourish.studio/visualisation/20196..." width="100%" alt="scatter visualization" />First Nvidia Blackwell training results

This round marked the first training tests for Nvidia's next GPU architecture, called Blackwell. For the GPT-3 training and LLM fine-tuning, the Blackwell (B200) roughly doubled the performance of the H100 on a per-GPU basis. The gains were a little less robust but still substantial for recommender systems and image generation-64 percent and 62 percent, respectively.

The Blackwell architecture, embodied in the Nvidia B200 GPU, continues an ongoing trend toward using less and less precise numbers to speed up AI. For certain parts of transformer neural networks such as ChatGPT, Llama2, and Stable Diffusion, the Nvidia H100 and H200 use 8-bit floating point numbers. The B200 brings that down to just 4 bits.

Google debuts 6th gen hardware

Google showed the first results for its 6^th generation of TPU, called Trillium-which it unveiled only last month-and a second round of results for its 5^th generation variant, the Cloud TPU v5p. In the 2023 edition, the search giant entered a different variant of the 5^th generation TPU, v5e, designed more for efficiency than performance. Versus the latter, Trillium delivers as much as a 3.8-fold performance boost on the GPT-3 training task.

But versus everyone's arch-rival Nvidia, things weren't as rosy. A system made up of 6,144 TPU v5ps reached the GPT-3 training checkpoint in 11.77 minutes, placing a distant second to an 11,616-Nvidia H100 system, which accomplished the task in about 3.44 minutes. That top TPU system was only about 25 seconds faster than an H100 computer half its size.

A Dell Technologies computer fine-tuned the Llama 2 70B large language model using about 75 cents worth of electricity.

In the closest head-to-head comparison between v5p and Trillium, with each system made up of 2048 TPUs, the upcoming Trillium shaved a solid 2 minutes off of the GPT-3 training time, nearly an 8 percent improvement on v5p's 29.6 minutes. Another difference between the Trillium and v5p entries is that Trillium is paired with AMD Epyc CPUs instead of the v5p's Intel Xeons.

Google also trained the image generator, Stable Diffusion, with the Cloud TPU v5p. At 2.6 billion parameters, Stable Diffusion is a light enough lift that MLPerf contestants are asked to train it to convergence instead of just to a checkpoint, as with GPT-3. A 1024 TPU system ranked second, finishing the job in 2 minutes 26 seconds, about a minute behind the same size system made up of Nvidia H100s.

https://public.flourish.studio/visualisation/20251..." target="_blank">https://public.flourish.studio/visualisation/20251..." width="100%" alt="chart visualization" />Training power is still opaque

The steep energy cost of training neural networks has long been a source of concern. MLPerf is only beginning to measure this. Dell Technologies was the sole entrant in the energy category, with an eight-server system containing 64 Nvidia H100 GPUs and 16 Intel Xeon Platinum CPUs. The only measurement made was in the LLM fine-tuning task (Llama2 70B). The system consumed 16.4 megajoules during its 5-minute run, for an average power of 5.4 kilowatts. That means about 75 cents of electricity at the average cost in the United States.

While it doesn't say much on its own, the result does potentially provide a ballpark for the power consumption of similar systems. Oracle, for example, reported a close performance result-4 minutes 45 seconds-using the same number and types of CPUs and GPUs.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/IeeeSpectrum
Feed Title	IEEE Spectrum
Feed Link	https://spectrum.ieee.org/