▲ | dragontamer 16 hours ago | ||||||||||||||||||||||||||||||||||
Note that AVX512 have per-lane execution masks so I'm not fully convinced that tail handling should even be a thing anymore. If(my lane is beyond the buffer) then (exec flag off, do not store my lane). Which in practice should be a simple vcompress instruction (AVX512 register) and maybe a move afterwards??? I admit that I'm not an AVX512 expert but it doesn't seem all that difficult with vcompress instructions + execmask. | |||||||||||||||||||||||||||||||||||
▲ | dzaima 15 hours ago | parent [-] | ||||||||||||||||||||||||||||||||||
It takes like 4 instrs to compute the mask from an arbitrary length (AVX-512 doesn't have any instruction for this so you need to do `bzhi(-1, min(left,vl))` and move that to a mask register) so you still would likely want to avoid it in the hot loop. Doing the tail separately but with masking SIMD is an improvement over a scalar loop perf-wise (..perhaps outside of the case of 1 or 2 elements, which is a realistic situation for a bunch of loops too), but it'll still add a double-digit percentage to code size over just a plain SIMD loop without tail handling. And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread (AVX2 does have masked load/store with 32-/64-bit granularity, but not 8-/16-bit, and the instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements); SSE has none, and ARM NEON doesn't have any either (and ARM SVE isn't widespread either, incl. not supported on apple silicon)) (don't need vcompress, plain masked load/store instrs do exist in AVX-512 and are sufficient) | |||||||||||||||||||||||||||||||||||
|