▲ | dzaima 20 hours ago | |||||||||||||||||||||||||||||||||||||||||||
Tail handling is not significant for loops with tons of iterations, but there are a ton of real-world situations where you might have a loop take only like 5 iterations or something (even at like 100 iterations, with a loop processing 8 elements at a time (i.e. 256-bit vectors, 32-bit elements), that's 12 vectorized iterations plus up to 7 scalar ones, which is still quite significant. At 1000 iterations you could still have the scalar tail be a couple percent; and still doubling the L1/uop-cache space the loop takes). It's absolutely a significant contributor to code size (..in scenarios where vectorized code in general is a significant contributor to code size, which admittedly is only very-specialized software). | ||||||||||||||||||||||||||||||||||||||||||||
▲ | dragontamer 16 hours ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||
Note that AVX512 have per-lane execution masks so I'm not fully convinced that tail handling should even be a thing anymore. If(my lane is beyond the buffer) then (exec flag off, do not store my lane). Which in practice should be a simple vcompress instruction (AVX512 register) and maybe a move afterwards??? I admit that I'm not an AVX512 expert but it doesn't seem all that difficult with vcompress instructions + execmask. | ||||||||||||||||||||||||||||||||||||||||||||
|