So how does this differ from available sse / avx instructions already in most x64 machines?

adrian_b 6 hours ago | parent | next [-]

This is not a vector extension (like Intel AVX/AVX-512 or Arm SVE), but a matrix extension (like Intel AMX or Arm SME or the "tensor" operations of NVIDIA GPUs).

Some of the latest generations of Intel server CPUs with P-cores already have the AMX matrix extension, which can be used to implement fast AI inference.

AMD has not implemented AMX yet, and probably they will not implement it, because this new "AI Compute Extension", which has been defined by Intel and AMD together, is an alternative/extension to AMX (ACE inherits some parts of AMX, but not all). It appears that the fate of Intel AMX will be the same as that of the original Apple undocumented AMX extension, which was replaced by the SME extension defined together with the Arm company (like Intel AMX will be replaced by ACE defined together with AMD).

Matrix extensions are more efficient for AI inference than vector extensions, because they reduce the ratio between memory accesses and computation operations.

However, I would like to have not only a matrix extension for AI, but also a matrix extension for all numeric formats until FP64, like in Arm/Apple SME or in the NVIDIA and AMD "datacenter" GPUs.

	▲	24245245t2 4 hours ago \| parent [-]
		ACE is a proliferation of AMX -- palette 2. So AMD is actually going to implement the basics of AMX -- whether they also do palette 1 is up to them.

▲

anematode 7 hours ago | parent | prev | next [-]

One thing that stuck out to me is that deals with a lot more data formats, in particular, low-precision formats like FP4, FP6 and FP8. Manipulating those formats can take a lot of annoying effort; in general, x86 (until AVX-512, at least) has unconvincing support for so-called "lane-crossing" instructions that move data across 16-byte boundaries within a vector. So you can imagine unpacking, e.g., tightly packed 7-bit data to 8-bit data is a real slog.

I can already immediately think of a use case for vunpackb in some of the stuff I'm working on, where we'd like to efficiently unpack weights from the high half of a vector.

Separately, adding all signed–unsigned variants of the VNNI dot product instructions is a welcome (albeit niche) change. There was an annoying divergence here between major ISAs: x86 added vpdpbusd which computed a dot product between u8 and i8, while ARM added vdotq, which computes a dot product either between u8 and u8 elements, or i8 and i8. So for broad compatibility, you generally had to restrict one of your inputs to [0,127]. This difference shows in the design of (for example) WASM relaxed SIMD, where the result of wasm.dot.i8x16.i7x16.add.signed is implementation-defined if you exceed the [0,127] range. ARM later added mixed-sign variants, and now x86 consummates it.

▲

dmitrygr 7 hours ago | parent | prev [-]

this also adds new registers to operate on (more state) - 1KB more state at least (512b x 16)

▲

bonzini 6 hours ago | parent [-]

It reuses AMX registers, so I think the only new state is the block scale register (1024 bits)?

▲

adrian_b 6 hours ago | parent [-]

The fraction of the installed base of x86 CPUs that support AMX is very small (i.e. only a part of the recent Intel server CPUs support AMX, while the other Intel server CPUs, all Intel consumer CPUs and all AMD CPUs do not support AMX).

CPUs with ACE will in most cases replace CPUs that did not support AMX, so all the registers specified by ACE, but not by AVX10 a.k.a. AVX-512 are new.

	▲	24245245t2 3 hours ago \| parent [-]
		From the ISA perspective, bonzini is right: only BSR0/SCALADATA is new.