▲ | dragontamer 15 hours ago | ||||||||||||||||
> It takes like 2 instrs to compute the mask from a length (AVX-512 doesn't have any instruction for this so you need to do a bzhi in GPR and move that to a mask register) so you still would likely want to avoid it in the hot loop. Keep a register with the values IdxAdjustment = [0, 1, 2, 3, 4, 5, 6, 7]. ExecutionMask = (Broadcast(CurIdx) + IdxAdjustment) < Length Keep looping while(any(vector) < Length), which is as simple as "while(exec_mask != 0)". I'm not seeing this take up any "extra" instructions at all. You needed the while() loop after all. It costs +1 Vector Register (IdxAdjustment) and a kMask by my count. > And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread AVX512 is over 10 years old now. And the premier SIMD execution instruction set is CUDA / NVidia, not AVX512. AVX512 is now available on all AMD CPUs and has been for the last two generations. It is also available on a select number of Intel CPUs. There is also AMD RDNA, Intel Xe ISAs that could be targeted. > instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements); Okay, I can see that possibly being an issue then. EDIT: AMD Zen5 Optimization Manual states Latency1 and throughput 2-per-clocktick, while Intel's Skylake documentation of https://www.intel.com/content/www/us/en/docs/intrinsics-guid... states Latency5 Throughput 1-per-clock-tick. AMD Zen5 seems to include support to vmovdqu8 (its in the optimization guide .xlsx sheet with latencies/throughputs, also as 1-latency / 4-throughput). This includes vmovdqu8 ( I'm not sure if the "mask" register changes the instruction. I'll do some research to see if I can verify your claim (I don't have my Zen5 computer built yet... but its soon). | |||||||||||||||||
▲ | dzaima 14 hours ago | parent [-] | ||||||||||||||||
That's two instrs - bumping the indices, and doing the comparison. You still want scalar pointer/index bumping for contiguous loads/stores (using gathers/scatters for those would be stupid and slow), and that gets you the end check for free via fused cmp+jcc. And those two instrs are vector instrs, i.e. competing with execution units for the actual thing you want to compute, whereas scalar instrs have at least some independent units that allow avoiding desiring infinite unrolling. And if your loop is processing 32-bit (or, worse, smaller) elements, those indices, if done as 64-bit, as most code will do, will cost even more. AVX-512 might be 10 years old, but Intel's latest (!) cores still don't support it on hardware with E-cores, so still a decade away from being able to just assume it exists. Another thread on this post mentioned that Intel has shipped hardware without AVX/AVX2/FMA as late as 2021 even. > Okay, I can see that possibly being an issue then. To be clear, that's only the AVX2 instrs; AVX-512 masked loads/stores are fast (..yes, even on Zen 4 where the AVX-512 masked loads/stores are fast, the AVX2 ones that do an equivalent amount of work (albeit taking the mask in a different register class) are slow). uops.info: https://uops.info/table.html?search=maskmovd%20m256&cb_lat=o... Intel also has AVX-512 masked 512-bit 8-bit-elt stores at half the throughput of unmasked for some reason (not 256-bit or ≥16-bit-elt though; presumably culprit being the mask having 64 elts): https://uops.info/table.html?search=movdqu8%20m512&cb_lat=on... And masked loads use some execution ports on both Intel and AMD, eating away from throughput of the main operation. All in all just not implemented for being able to needlessly use masked loads/stores in hot loops. | |||||||||||||||||
|