▲ | dragontamer 14 hours ago | |
Gotcha. Makes sense. Thanks for the discussion! Overall, I agree that AVX and Neon have their warts and performance issues. But they're like 15+ years old now and designed well before GPU Compute was possible. > using gathers/scatters for those would be stupid and slow This is where CPUs are really bad. GPUs will coalesce gather/scatters thanks to __shared__ memory (with human assistance of course). But also the simplest of load/store patters are auto-detected and coalesced. So a GPU programmer doesn't have to worry about SIMD lane load/store (called vgather in AVX512) being slower. It's all optimized to hell and back. Having a full lane-to-lane crossbar and supporting high performance memory access patterns needs to be a priority moving forward. | ||
▲ | dzaima 2 hours ago | parent [-] | |
Thanks for the info on how things look on the GPU side! A messy thing with memory performance on CPUs is that either you share the same cache hardware between scalar and vector, thereby significantly limiting how much latency you can trade for throughput, or you have to add special vector L1 cache, which is a ton of mess and silicon area; never mind uses of SIMD that are latency-sensitive, e.g. SIMD hashmap probing, or small loops. I guess you don't necessarily need that for just detecting patterns in gather indices, but nothing's gonna get a gather of consecutive 8-bit elts via 64-bit indices to not perform much slower than a single contiguous load, and 8-bit elts are quite important on CPUs for strings & co. |