| ▲ | camel-cdr 5 days ago | |
> The answer, if it’s not obvious from my tone already:), is 8%. Not if the data is small and in cache. > The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it. I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets. | ||
| ▲ | shihab 5 days ago | parent [-] | |
Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate. | ||