Remix.run Logo
timewizard 2 months ago

> Another problem is that each new SIMD generation requires new instruction opcodes and encodings.

It requires new opcodes. It does not strictly require new encodings. Several new encodings are legacy compatible and can encode previous generations vector instructions.

> so the architecture must provide enough SIMD registers to avoid register spilling.

Or the architecture allows memory operands. The great joy of basic x86 encoding is that you don't actually need to put things in registers to operate on them.

> Usually you also need extra control logic before the loop. For instance if the array length is less than the SIMD register width, the main SIMD loop should be skipped.

What do you want? No control overhead or the speed enabled by SIMD? This isn't a flaw. This is a necessary price to achieve the efficiency you do in the main loop.

dzaima 2 months ago | parent | next [-]

> The great joy of basic x86 encoding is that you don't actually need to put things in registers to operate on them.

That's... 1 register saved, out of 16 (or 32 on AVX-512). Perhaps useful sometimes, but far from a particularly significant aspect spill-wise.

And doing that means you lose the ability to move the load earlier (perhaps not too significant on OoO hardware, but still a consideration; while reorder windows are multiple hundreds of instructions, the actual OoO limit is scheduling queues, which are frequently under a hundred entries, i.e. a couple dozen cycles worth of instructions, at which point the ≥4 cycle latency of a load is not actually insignificant. And putting the load directly in the arith op is the worst-case scenario for this)

camel-cdr 2 months ago | parent | prev [-]

> The great joy of basic x86 encoding is that you don't actually need to put things in registers to operate on them.

That's just spilling with fewer steps. The executed uops should be the same.

timewizard 2 months ago | parent [-]

> That's just spilling with fewer steps.

Another way to say this is it's "more efficient."

> The executed uops should be the same.

And "more densely coded."

camel-cdr 2 months ago | parent [-]

hm, I was wondering how the density compares with x86 having more complex encodings in general.

vaddps zmm1,zmm0,ZMMWORD PTR [r14]

takes six bytes to encode:

62 d1 7c 48 58 0e

In SVE and RVV a load+add takes 8 bytes to encode.