Yes, this can make sense if

- the value is often doesn't require an update, and

- there's contention on the cache line, i.e., at least two cores frequently read or write that cache line.

But there are important details to consider:

1) The probing load must be atomic. Both the compiler and the processor in general are allowed to split non-atomic loads into two or more partial loads. Only atomic loads – even with relaxed ordering – are guaranteed to not return intermediate or mixed values from other atomic stores.

2) If the ordering on the read part of the atomic read-modify-write operation is not relaxed, the probing load must reflect this. For example, an acq-rel RMW op would require an acquire ordering on the probing read.

▲

anematode 5 hours ago | parent [-]

Thanks for your insights. (2) makes sense to me, but for (1), on ARM64 can an aligned 64-bit store really tear in a 64-bit non-atomic load? The spec says "A write that is generated by a store instruction that stores a single general-purpose register and is aligned to the size of the write in the instruction is single-copy atomic" (B2.2.1)

	▲	adwn 3 hours ago \| parent [-]
		> […] on ARM64 […] Well, if you target a specific architecture, then of course you can assume more guarantees than in general, portable code. And in general, a processor might distinguish between non-atomic and relaxed-atomic reads and writes – in theory. But more important, and relevant in practice, is the behavior of the compiler. C, C++, and Rust compilers are allowed to assume that non-atomic reads aren't influenced by concurrent writes, so the compiler is allowed to split non-atomic reads into smaller reads (unlikely) or even optimize the reads away if it can prove that the memory location isn't written to by the local thread (more likely).