This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)

Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?

> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.

So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.

> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.

Yeah, ok, so the huge magnitude observed is due to some really poor program design.

> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).

Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.

▲

gpderetta 2 days ago | parent | next [-]

My understanding is that x86 implementations use speculation to be able to reorder beyond what's allowed by the memory model. This is not free in area and power, but allows recovering some of the cost of the stronger memory model.

As TSO support is only a transitional aid for Apple, it is possible that they didn't bother to implement the full extend of optimizations possible.

▲

Someone a day ago | parent [-]

Or chose not to fully implement it. Speculative execution has its share of security issues, so they may have chosen to be cautious.

	▲	adgjlsfhk1 a day ago \| parent [-]
		based on the value speculation they do, side channel security doesn't seem to have been one of the primary goals

▲

ip26 2 days ago | parent | prev [-]

I’m not an expert… but it seems like it could be even simpler than program design. They note false sharing occurs due to data not being cacheline aligned. Yet when compiling for ARM, that’s not a big deal due to WO. When targeting x86, you would hope the compiler would work hard to align them! So the out of the box compiler behavior could be crucial. Are there extra flags that should be used when targeting ARM-TSO?

	▲	loeg 2 days ago \| parent [-]
		False sharing mostly needs to be avoided with program design. I'm not aware of any compiler flags that help here.