Remix clone Hacker News

new | show | ask | jobs Github

	▲	Sesse__ a year ago
		The place where I see this really hurts goes when Clang/LLVM gets too fancy, in situations like this: `- Function A calls function B, which returns some struct S (for instance on the stack). - B writes S by individual (small) stores. - A wants to copy S from some place to another (e.g. store it in some other struct). - LLVM coalesces the individual loads/stores needed to copy S, into one or a series of large operations (e.g. 128-bit SSE2 loads+stores). - These large loads are issued while the small stores from B are still pending, and necessarily overlap them.` Boom, store-to-load forwarding failure, and a bad stall. E.g., the Zen series seem to be really bad at this (only tried up to Zen 3), but there are pretty much no out-of-order CPUs that handle this without some kind of penalty.
	▲	ack_complete a year ago \| parent [-]
		This happens with partial autovectorization, too. Compiler fails to vectorize a first loop and then vectorizes the second, result is a store forwarding failure at the start of the second loop trying to read the output of the first loop, erasing the vectorization gains.