| I liked the article. I saw your PS that we added it to the working draft for c++26, we also made it part of OpenMP as of 5.0 I think. It’s sometimes a hardware atomic like on arm, but what made the case was that it’s common to implement it sub-optimally even on x86 or LL-SC architectures. Often the generic cas loop gets used, like in your lambda example, but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter. Also can benefit from using slightly different memory orders than the default on architectures like ppc64. It’s a surprisingly useful op to support that way. If this kind of thing floats your boat, you might be interested in the non-reading variants of these as well. Mostly for things like add, max, etc but some recent architectures actually offer alternate operations to skip the read-back. The paper calls them “atomic reduction operations” https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31... |
| |
| ▲ | anematode 10 hours ago | parent | next [-] | | Curious: even with hardware atomics, wouldn't it be a good idea to first perform a non-atomic load to check for whether the store might be necessary (which would require the cache line to be locked), then only run the atomic max if it might change the value? | | |
| ▲ | adgjlsfhk1 8 hours ago | parent | next [-] | | This depends heavily on what concurrency optimizations your processor implements (and unfortunately this is the sort of thing that doesn't get doccumented and is somewhat hard to test). | | |
| ▲ | anematode 7 hours ago | parent [-] | | I did a little unscientific test here on an Apple M4 Pro with n threads spamming atomic operations with pseudorandom values on one memory location (the worst case). Used inline asm to make sure there was no funny business going on. atomic adds
n = 1 -> 333e6 adds/second
n = 2 -> 174e6
n = 4 -> 95e6
n = 8 -> 63e6
atomic maxs
n = 1 -> 161e6 maxs/second
n = 2 -> 59e6
n = 4 -> 39e6
n = 8 -> 27e6
atomic maxs with preceding check
n = 1 -> 929e6 maxs/second
n = 2 -> 1541e6
n = 4 -> 3494e6
n = 8 -> 5985e6
So evidently the M4 doesn't do this optimization. Of course if your distribution is different you'd get different results, and this level of contention is unrealistic, but I don't see why you'd EVER not do a check before running atomic max. I also find it interesting that atomic max is significantly slower than atomic add | | |
| ▲ | thequux 7 hours ago | parent [-] | | I think that this can change the semantics though; with the preceding check you can miss the shared variable being decremented from another thread. In some cases, such as if the shared value is monotonic, this is done, but not in the general case. | | |
| ▲ | anematode 6 hours ago | parent [-] | | With a relaxed ordering I'm not sure if that's right, since the ldumax would have no imposed ordering relation with the (atomic) decrement on another thread and so could very well have operated on the old value obtained by the non-atomic load | | |
| ▲ | gpderetta 4 hours ago | parent | next [-] | | All operations on a single memory location are always totally ordered in a CC system, no matter how relaxed the memory model is. Also am I understanding it correctly that n is the number of threads in your example? Don't you find it suspicious that the number of operations goes up as the thread count goes up? edit: ok, you are saying that under heavy contention the check avoids having to do the store at all. This is racy, and whether this is correct or not, would be very application specific. edit2: I thought about this a bit, and I'm not sure i can come up with a scenario where the race matters... edit3: ... as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume. | | |
| ▲ | Dylan16807 2 hours ago | parent [-] | | > as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume. What assumes that? If your early read gives you a higher number, quitting out immediately is the same as doing the max that same nanosecond. You avoid setting a variable to the same value it already is. Doing or not doing that write shouldn't affect other atomics users, should it? In general, I should be able to add or remove as many atomic(x=x) operations as I want without changing the result, right? And if your early read is lower then you just do the max and having an extra read is harmless. The only case I can think of that goes wrong is the read (and elided max) happening too early in relation to accesses to other variables, but we're assuming relaxed memory order here so that's explicitly acceptable. | | |
| ▲ | gpderetta an hour ago | parent [-] | | Yes, probably you are right: a load that finds a larger value is equivalent to a max. As the max wouldn't store any value in this case, also it wouldn't introduce any synchronization edge. A load that finds a smaller value is trickier to analyze, but i think you are just free to ignore it and just proceed with the atomic max. An underlying LL/SC loop to implement a max operation might spuriously fail anyway. edit: here is another argument in favour: if your only atomic RMW is a cas, to implement X.atomic_max(new) you would: 1: expected <- X
2: if new < expected: done
3: else if X.cas(expected, y): done
else goto 2 # expected implicitly refreshed
So a cas loop would naturally implement the same optimization (unless it starts with a random expected), so the race is benign. |
|
| |
| ▲ | ibraheemdev 4 hours ago | parent | prev [-] | | It does make a difference of course if you're running fetch_max from multiple threads, adding a load fast-path introduces a race condition. | | |
| ▲ | masklinn 3 hours ago | parent | next [-] | | Does it tho? Assuming no torn reads/writes at those sizes, given the location should be strictly increasing are there situations where you could read a higher-than-stored value which would cause skipping a necessary update? Afaik on all of x86, arm, and riscv an atomic load of a word sized datum is just a regular load. | | |
| ▲ | gpderetta an hour ago | parent [-] | | It doesn't need to be strictly increasing some other thread could be making other arbitrary operations. Still even in that case, as Dylan16807 pointed out, it likely doesn't matter. |
| |
| ▲ | 4 hours ago | parent | prev [-] | | [deleted] |
|
|
|
|
| |
| ▲ | adwn 6 hours ago | parent | prev [-] | | Yes, this can make sense if - the value is often doesn't require an update, and - there's contention on the cache line, i.e., at least two cores frequently read or write that cache line. But there are important details to consider: 1) The probing load must be atomic. Both the compiler and the processor in general are allowed to split non-atomic loads into two or more partial loads. Only atomic loads – even with relaxed ordering – are guaranteed to not return intermediate or mixed values from other atomic stores. 2) If the ordering on the read part of the atomic read-modify-write operation is not relaxed, the probing load must reflect this. For example, an acq-rel RMW op would require an acquire ordering on the probing read. | | |
| ▲ | anematode 5 hours ago | parent [-] | | Thanks for your insights. (2) makes sense to me, but for (1), on ARM64 can an aligned 64-bit store really tear in a 64-bit non-atomic load? The spec says "A write that is generated by a store instruction that stores a single general-purpose register and is aligned to the
size of the write in the instruction is single-copy atomic" (B2.2.1) | | |
| ▲ | adwn 3 hours ago | parent [-] | | > […] on ARM64 […] Well, if you target a specific architecture, then of course you can assume more guarantees than in general, portable code. And in general, a processor might distinguish between non-atomic and relaxed-atomic reads and writes – in theory. But more important, and relevant in practice, is the behavior of the compiler. C, C++, and Rust compilers are allowed to assume that non-atomic reads aren't influenced by concurrent writes, so the compiler is allowed to split non-atomic reads into smaller reads (unlikely) or even optimize the reads away if it can prove that the memory location isn't written to by the local thread (more likely). |
|
|
| |
| ▲ | SkiFire13 6 hours ago | parent | prev [-] | | > but it lacks an early cutout since you can ignore any input value that’s on the wrong side of the op by doing a cheap atomic read or just cutting out of the loop after the first failed CAS if the read back shows it can’t matter. I believe this is a bit trickier than that, you would also need at least some kind of atomic barrier to preserve the ordering semantics of the successful update case. |
|