SSVE instructions are executed by the SME engine, which trades latency for throughput. SSVE is really intended to support use of SME, rather than as a replacement for Advanced SIMD on the CPU core itself.

The Apple Silicon CPU Optimization Guide has a lot of great information on SME and SSVE, along with more general information on optimizing for Apple's CPUs

A few quotes from Apple's guide that are particularly relevant to SSVE, from "SSVE Vector Execution Unit Optimization":

> Broadly, this unit is designed to support long vector and matrix operations performed on ZA storage _in the SME Processing Grid_.

> Recommendation: Use SSVE in a supporting role to enable high throughput SME grid computation.

> [Magnitude: High | Applicability: High] SSVE offers wide 64B vectors. While the ISA includes instructions that can operate on multi-vectors, the throughput is often only one 64B vector per cycle. Use SSVE to enable SME, which offers higher parallelism.

> Because of non-speculative execution, communication latencies, and in some cases long memory and computation latencies, SME engine instructions trail execution in the core by dozens to thousands of cycles. Any core compute instructions that consume data produced by the SME engine may have to wait an indeterminate (but long) amount of time for the data to arrive.

▲

tom_ 6 hours ago | parent | next [-]

I was struck by the "Magnitude: High | Applicability: High" bit. Who writes like this? More importantly, who reads like this? The V4 doc (which I have yet to read, but I did a text search) has 64 occurences of this sort of phrasing; not actually all that many, given that there's 293 pages, but enough to be interesting. I wonder if this extra stuff is there to make LLMs pay particular attention.

▲

bdash 6 hours ago | parent [-]

Intel's software optimization guides have similar annotations on many of their guidelines, and have done since long before LLMs were a thing. As a reader it's useful to know how impactful a given recommendation is and how generally applicable it is without having to read the more detailed explanations.

	▲	tom_ 5 hours ago \| parent [-]
		Ahh, interesting, thanks. (I read the reference manuals but typically ignore the rest... I don't need to write this stuff, just read it!) I've seen people recommend creating docs to be LLM-friendly and I was wondering if this was an instance of that.

▲

anematode 6 hours ago | parent | prev [-]

That makes a ton of sense and aligns with my observations. Thanks for the resource :)

If SSVE is slow, I was hoping that SME instructions could be used in a vector-like fashion (e.g. add two matrices with high throughput, or a Hadamard/element-wise product) but it seems most matrix accelerator ISAs don't have that.

	▲	bdash 5 hours ago \| parent \| next [-]
		There are SME / SME2 instructions that use the ZA tiles as vector registers / vector groups. These can take advantage of the higher throughput of the SME processing grid vs SSVE instructions that operate on Z registers. See the `FMLA (SME2)` case under Peak Performance at https://scalable.uni-jena.de/opt/sme/micro.html#peak-perform....
	▲	4 hours ago \| parent \| prev [-]
		[deleted]