Article 76H4D Intel And AMD's New ACE CPU Extensions Bring An Efficient AI-Oriented Instruction Set To X86

Intel And AMD's New ACE CPU Extensions Bring An Efficient AI-Oriented Instruction Set To X86

by
janrinok
from SoylentNews on (#76H4D)

Arthur T Knackerbracket writes:

https://www.tomshardware.com/pc-components/cpus/intel-and-amds-new-ace-cpu-extensions-bring-an-efficient-ai-oriented-instruction-set-to-x86-a-new-design-makes-matrix-multiplication-more-power-and-density-efficient

ACE comes in by offering a technical standard [.PDF] that leverages the existing AVX10 registers but adds silicon dedicated to matrix multiplication. This brings multiple benefits, but the key advantages are better power efficiency, easier development and optimization, and leveraging AVX's 512-bit inputs. The latter makes for easy integration with existing designs by eschewing the need for ACE-specific inputs.

For the same number of input vectors, ACE can perform 16x as many operations, compared to AVX10. Note this doesn't necessarily mean a 16x speedup, as that will depend on each individual implementation, but it's reasonable to expect that Intel and AMD will dedicate more silicon to this task in future designs to improve performance. Plus, as each ACE instruction performs more work than its equivalent AVX10 loop, there's less CPU instruction overhead and potentially better RAM bandwidth usage right off the bat.

The benefits go far beyond just using fewer instructions for the same thing. ACE is intended to be implementation-agnostic, meaning that ML frameworks and their underlying libraries (PyTorch, TensorFlow) can just write one code path instead of having multiple variations depending on the underlying hardware and its degree of AVX support.

ACE native supports most every data type used in ML operations (including but not limited to INT8, INT32, FP8, FP16, FP32, BF16), but it also can use Open Compute Project's MX block-scaled formats natively, something that AVX10 does not provide. Developers will also be able to move some NPU-specific workloads back to CPU when they need something done now and fast. In those situations, not having to deal with the fact that each NPU is different is a huge boon, too, as ACE offers a consistent target across x86 hardware.

Original Submission

Read more of this story at SoylentNews.

External Content
Source RSS or Atom Feed
Feed Location https://soylentnews.org/index.rss
Feed Title SoylentNews
Feed Link https://soylentnews.org/
Feed Copyright Copyright 2014, SoylentNews
Reply 0 comments