▲ | dzaima 15 hours ago | |||||||||||||||||||||||||
It takes like 4 instrs to compute the mask from an arbitrary length (AVX-512 doesn't have any instruction for this so you need to do `bzhi(-1, min(left,vl))` and move that to a mask register) so you still would likely want to avoid it in the hot loop. Doing the tail separately but with masking SIMD is an improvement over a scalar loop perf-wise (..perhaps outside of the case of 1 or 2 elements, which is a realistic situation for a bunch of loops too), but it'll still add a double-digit percentage to code size over just a plain SIMD loop without tail handling. And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread (AVX2 does have masked load/store with 32-/64-bit granularity, but not 8-/16-bit, and the instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements); SSE has none, and ARM NEON doesn't have any either (and ARM SVE isn't widespread either, incl. not supported on apple silicon)) (don't need vcompress, plain masked load/store instrs do exist in AVX-512 and are sufficient) | ||||||||||||||||||||||||||
▲ | dragontamer 15 hours ago | parent [-] | |||||||||||||||||||||||||
> It takes like 2 instrs to compute the mask from a length (AVX-512 doesn't have any instruction for this so you need to do a bzhi in GPR and move that to a mask register) so you still would likely want to avoid it in the hot loop. Keep a register with the values IdxAdjustment = [0, 1, 2, 3, 4, 5, 6, 7]. ExecutionMask = (Broadcast(CurIdx) + IdxAdjustment) < Length Keep looping while(any(vector) < Length), which is as simple as "while(exec_mask != 0)". I'm not seeing this take up any "extra" instructions at all. You needed the while() loop after all. It costs +1 Vector Register (IdxAdjustment) and a kMask by my count. > And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread AVX512 is over 10 years old now. And the premier SIMD execution instruction set is CUDA / NVidia, not AVX512. AVX512 is now available on all AMD CPUs and has been for the last two generations. It is also available on a select number of Intel CPUs. There is also AMD RDNA, Intel Xe ISAs that could be targeted. > instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements); Okay, I can see that possibly being an issue then. EDIT: AMD Zen5 Optimization Manual states Latency1 and throughput 2-per-clocktick, while Intel's Skylake documentation of https://www.intel.com/content/www/us/en/docs/intrinsics-guid... states Latency5 Throughput 1-per-clock-tick. AMD Zen5 seems to include support to vmovdqu8 (its in the optimization guide .xlsx sheet with latencies/throughputs, also as 1-latency / 4-throughput). This includes vmovdqu8 ( I'm not sure if the "mask" register changes the instruction. I'll do some research to see if I can verify your claim (I don't have my Zen5 computer built yet... but its soon). | ||||||||||||||||||||||||||
|