Remix.run Logo
dzaima 17 hours ago

That's two instrs - bumping the indices, and doing the comparison. You still want scalar pointer/index bumping for contiguous loads/stores (using gathers/scatters for those would be stupid and slow), and that gets you the end check for free via fused cmp+jcc.

And those two instrs are vector instrs, i.e. competing with execution units for the actual thing you want to compute, whereas scalar instrs have at least some independent units that allow avoiding desiring infinite unrolling.

And if your loop is processing 32-bit (or, worse, smaller) elements, those indices, if done as 64-bit, as most code will do, will cost even more.

AVX-512 might be 10 years old, but Intel's latest (!) cores still don't support it on hardware with E-cores, so still a decade away from being able to just assume it exists. Another thread on this post mentioned that Intel has shipped hardware without AVX/AVX2/FMA as late as 2021 even.

> Okay, I can see that possibly being an issue then.

To be clear, that's only the AVX2 instrs; AVX-512 masked loads/stores are fast (..yes, even on Zen 4 where the AVX-512 masked loads/stores are fast, the AVX2 ones that do an equivalent amount of work (albeit taking the mask in a different register class) are slow). uops.info: https://uops.info/table.html?search=maskmovd%20m256&cb_lat=o...

Intel also has AVX-512 masked 512-bit 8-bit-elt stores at half the throughput of unmasked for some reason (not 256-bit or ≥16-bit-elt though; presumably culprit being the mask having 64 elts): https://uops.info/table.html?search=movdqu8%20m512&cb_lat=on...

And masked loads use some execution ports on both Intel and AMD, eating away from throughput of the main operation. All in all just not implemented for being able to needlessly use masked loads/stores in hot loops.

dragontamer 17 hours ago | parent [-]

Gotcha. Makes sense. Thanks for the discussion!

Overall, I agree that AVX and Neon have their warts and performance issues. But they're like 15+ years old now and designed well before GPU Compute was possible.

> using gathers/scatters for those would be stupid and slow

This is where CPUs are really bad. GPUs will coalesce gather/scatters thanks to __shared__ memory (with human assistance of course).

But also the simplest of load/store patters are auto-detected and coalesced. So a GPU programmer doesn't have to worry about SIMD lane load/store (called vgather in AVX512) being slower. It's all optimized to hell and back.

Having a full lane-to-lane crossbar and supporting high performance memory access patterns needs to be a priority moving forward.

dzaima 5 hours ago | parent [-]

Thanks for the info on how things look on the GPU side!

A messy thing with memory performance on CPUs is that either you share the same cache hardware between scalar and vector, thereby significantly limiting how much latency you can trade for throughput, or you have to add special vector L1 cache, which is a ton of mess and silicon area; never mind uses of SIMD that are latency-sensitive, e.g. SIMD hashmap probing, or small loops.

I guess you don't necessarily need that for just detecting patterns in gather indices, but nothing's gonna get a gather of consecutive 8-bit elts via 64-bit indices to not perform much slower than a single contiguous load, and 8-bit elts are quite important on CPUs for strings & co.