Remix clone Hacker News

new | show | ask | jobs Github

	▲	camel-cdr 5 days ago
		> The answer, if it’s not obvious from my tone already:), is 8%. Not if the data is small and in cache. > The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it. I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
	▲	shihab 5 days ago \| parent [-]
		Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.