Intel And AMD's New ACE CPU Extensions Bring An Efficient AI-Oriented Instruction Set To X86
Arthur T Knackerbracket writes:
ACE comes in by offering a technical standard [.PDF] that leverages the existing AVX10 registers but adds silicon dedicated to matrix multiplication. This brings multiple benefits, but the key advantages are better power efficiency, easier development and optimization, and leveraging AVX's 512-bit inputs. The latter makes for easy integration with existing designs by eschewing the need for ACE-specific inputs.
For the same number of input vectors, ACE can perform 16x as many operations, compared to AVX10. Note this doesn't necessarily mean a 16x speedup, as that will depend on each individual implementation, but it's reasonable to expect that Intel and AMD will dedicate more silicon to this task in future designs to improve performance. Plus, as each ACE instruction performs more work than its equivalent AVX10 loop, there's less CPU instruction overhead and potentially better RAM bandwidth usage right off the bat.
The benefits go far beyond just using fewer instructions for the same thing. ACE is intended to be implementation-agnostic, meaning that ML frameworks and their underlying libraries (PyTorch, TensorFlow) can just write one code path instead of having multiple variations depending on the underlying hardware and its degree of AVX support.
ACE native supports most every data type used in ML operations (including but not limited to INT8, INT32, FP8, FP16, FP32, BF16), but it also can use Open Compute Project's MX block-scaled formats natively, something that AVX10 does not provide. Developers will also be able to move some NPU-specific workloads back to CPU when they need something done now and fast. In those situations, not having to deal with the fact that each NPU is different is a huge boon, too, as ACE offers a consistent target across x86 hardware.
Read more of this story at SoylentNews.