Note that AVX512 have per-lane execution masks so I'm not fully convinced that tail handling should even be a thing anymore.

If(my lane is beyond the buffer) then (exec flag off, do not store my lane).

Which in practice should be a simple vcompress instruction (AVX512 register) and maybe a move afterwards??? I admit that I'm not an AVX512 expert but it doesn't seem all that difficult with vcompress instructions + execmask.

▲

dzaima 6 months ago | parent [-]

It takes like 4 instrs to compute the mask from an arbitrary length (AVX-512 doesn't have any instruction for this so you need to do `bzhi(-1, min(left,vl))` and move that to a mask register) so you still would likely want to avoid it in the hot loop.

Doing the tail separately but with masking SIMD is an improvement over a scalar loop perf-wise (..perhaps outside of the case of 1 or 2 elements, which is a realistic situation for a bunch of loops too), but it'll still add a double-digit percentage to code size over just a plain SIMD loop without tail handling.

And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread (AVX2 does have masked load/store with 32-/64-bit granularity, but not 8-/16-bit, and the instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements); SSE has none, and ARM NEON doesn't have any either (and ARM SVE isn't widespread either, incl. not supported on apple silicon))

(don't need vcompress, plain masked load/store instrs do exist in AVX-512 and are sufficient)

▲

dragontamer 6 months ago | parent [-]

> It takes like 2 instrs to compute the mask from a length (AVX-512 doesn't have any instruction for this so you need to do a bzhi in GPR and move that to a mask register) so you still would likely want to avoid it in the hot loop.

Keep a register with the values IdxAdjustment = [0, 1, 2, 3, 4, 5, 6, 7].

ExecutionMask = (Broadcast(CurIdx) + IdxAdjustment) < Length

Keep looping while(any(vector) < Length), which is as simple as "while(exec_mask != 0)".

I'm not seeing this take up any "extra" instructions at all. You needed the while() loop after all. It costs +1 Vector Register (IdxAdjustment) and a kMask by my count.

> And this doesn't help pre-AVX-512, and AVX-512 isn't particularly widespread

AVX512 is over 10 years old now. And the premier SIMD execution instruction set is CUDA / NVidia, not AVX512.

AVX512 is now available on all AMD CPUs and has been for the last two generations. It is also available on a select number of Intel CPUs. There is also AMD RDNA, Intel Xe ISAs that could be targeted.

> instrs that do exist are rather slow on AMD (e.g. unconditional 12 cycles/instr throughput for masked-storing 8 32-bit elements);

Okay, I can see that possibly being an issue then.

EDIT: AMD Zen5 Optimization Manual states Latency1 and throughput 2-per-clocktick, while Intel's Skylake documentation of https://www.intel.com/content/www/us/en/docs/intrinsics-guid... states Latency5 Throughput 1-per-clock-tick.

AMD Zen5 seems to include support to vmovdqu8 (its in the optimization guide .xlsx sheet with latencies/throughputs, also as 1-latency / 4-throughput). This includes vmovdqu8 (

I'm not sure if the "mask" register changes the instruction. I'll do some research to see if I can verify your claim (I don't have my Zen5 computer built yet... but its soon).

▲

dzaima 6 months ago | parent [-]

That's two instrs - bumping the indices, and doing the comparison. You still want scalar pointer/index bumping for contiguous loads/stores (using gathers/scatters for those would be stupid and slow), and that gets you the end check for free via fused cmp+jcc.

And those two instrs are vector instrs, i.e. competing with execution units for the actual thing you want to compute, whereas scalar instrs have at least some independent units that allow avoiding desiring infinite unrolling.

And if your loop is processing 32-bit (or, worse, smaller) elements, those indices, if done as 64-bit, as most code will do, will cost even more.

AVX-512 might be 10 years old, but Intel's latest (!) cores still don't support it on hardware with E-cores, so still a decade away from being able to just assume it exists. Another thread on this post mentioned that Intel has shipped hardware without AVX/AVX2/FMA as late as 2021 even.

> Okay, I can see that possibly being an issue then.

To be clear, that's only the AVX2 instrs; AVX-512 masked loads/stores are fast (..yes, even on Zen 4 where the AVX-512 masked loads/stores are fast, the AVX2 ones that do an equivalent amount of work (albeit taking the mask in a different register class) are slow). uops.info: https://uops.info/table.html?search=maskmovd%20m256&cb_lat=o...

Intel also has AVX-512 masked 512-bit 8-bit-elt stores at half the throughput of unmasked for some reason (not 256-bit or ≥16-bit-elt though; presumably culprit being the mask having 64 elts): https://uops.info/table.html?search=movdqu8%20m512&cb_lat=on...

And masked loads use some execution ports on both Intel and AMD, eating away from throughput of the main operation. All in all just not implemented for being able to needlessly use masked loads/stores in hot loops.

▲

dragontamer 6 months ago | parent [-]

Gotcha. Makes sense. Thanks for the discussion!

Overall, I agree that AVX and Neon have their warts and performance issues. But they're like 15+ years old now and designed well before GPU Compute was possible.

> using gathers/scatters for those would be stupid and slow

This is where CPUs are really bad. GPUs will coalesce gather/scatters thanks to __shared__ memory (with human assistance of course).

But also the simplest of load/store patters are auto-detected and coalesced. So a GPU programmer doesn't have to worry about SIMD lane load/store (called vgather in AVX512) being slower. It's all optimized to hell and back.

Having a full lane-to-lane crossbar and supporting high performance memory access patterns needs to be a priority moving forward.

	▲	dzaima 6 months ago \| parent [-]
		Thanks for the info on how things look on the GPU side! A messy thing with memory performance on CPUs is that either you share the same cache hardware between scalar and vector, thereby significantly limiting how much latency you can trade for throughput, or you have to add special vector L1 cache, which is a ton of mess and silicon area; never mind uses of SIMD that are latency-sensitive, e.g. SIMD hashmap probing, or small loops. I guess you don't necessarily need that for just detecting patterns in gather indices, but nothing's gonna get a gather of consecutive 8-bit elts via 64-bit indices to not perform much slower than a single contiguous load, and 8-bit elts are quite important on CPUs for strings & co.