Remix.run Logo
timewizard 21 hours ago

> Another problem is that each new SIMD generation requires new instruction opcodes and encodings.

It requires new opcodes. It does not strictly require new encodings. Several new encodings are legacy compatible and can encode previous generations vector instructions.

> so the architecture must provide enough SIMD registers to avoid register spilling.

Or the architecture allows memory operands. The great joy of basic x86 encoding is that you don't actually need to put things in registers to operate on them.

> Usually you also need extra control logic before the loop. For instance if the array length is less than the SIMD register width, the main SIMD loop should be skipped.

What do you want? No control overhead or the speed enabled by SIMD? This isn't a flaw. This is a necessary price to achieve the efficiency you do in the main loop.

camel-cdr 21 hours ago | parent [-]

> The great joy of basic x86 encoding is that you don't actually need to put things in registers to operate on them.

That's just spilling with fewer steps. The executed uops should be the same.

timewizard 20 hours ago | parent [-]

> That's just spilling with fewer steps.

Another way to say this is it's "more efficient."

> The executed uops should be the same.

And "more densely coded."

camel-cdr 19 hours ago | parent [-]

hm, I was wondering how the density compares with x86 having more complex encodings in general.

vaddps zmm1,zmm0,ZMMWORD PTR [r14]

takes six bytes to encode:

62 d1 7c 48 58 0e

In SVE and RVV a load+add takes 8 bytes to encode.