Remix.run Logo
camel-cdr 5 days ago

> The answer, if it’s not obvious from my tone already:), is 8%.

Not if the data is small and in cache.

> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.

I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.

shihab 5 days ago | parent [-]

Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.