With a relaxed ordering I'm not sure if that's right, since the ldumax would have no imposed ordering relation with the (atomic) decrement on another thread and so could very well have operated on the old value obtained by the non-atomic load

▲ gpderetta 4 hours ago | parent | next [-]

All operations on a single memory location are always totally ordered in a CC system, no matter how relaxed the memory model is.

Also am I understanding it correctly that n is the number of threads in your example? Don't you find it suspicious that the number of operations goes up as the thread count goes up?

edit: ok, you are saying that under heavy contention the check avoids having to do the store at all. This is racy, and whether this is correct or not, would be very application specific.

edit2: I thought about this a bit, and I'm not sure i can come up with a scenario where the race matters...

edit3: ... as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.

▲ Dylan16807 2 hours ago | parent [-]

> as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.

What assumes that?

If your early read gives you a higher number, quitting out immediately is the same as doing the max that same nanosecond. You avoid setting a variable to the same value it already is. Doing or not doing that write shouldn't affect other atomics users, should it?

In general, I should be able to add or remove as many atomic(x=x) operations as I want without changing the result, right?

And if your early read is lower then you just do the max and having an extra read is harmless.

The only case I can think of that goes wrong is the read (and elided max) happening too early in relation to accesses to other variables, but we're assuming relaxed memory order here so that's explicitly acceptable.

	▲	gpderetta an hour ago \| parent [-]
		Yes, probably you are right: a load that finds a larger value is equivalent to a max. As the max wouldn't store any value in this case, also it wouldn't introduce any synchronization edge. A load that finds a smaller value is trickier to analyze, but i think you are just free to ignore it and just proceed with the atomic max. An underlying LL/SC loop to implement a max operation might spuriously fail anyway. edit: here is another argument in favour: if your only atomic RMW is a cas, to implement X.atomic_max(new) you would: `1: expected <- X 2: if new < expected: done 3: else if X.cas(expected, y): done else goto 2 # expected implicitly refreshed` So a cas loop would naturally implement the same optimization (unless it starts with a random expected), so the race is benign.

▲ ibraheemdev 4 hours ago | parent | prev [-]

It does make a difference of course if you're running fetch_max from multiple threads, adding a load fast-path introduces a race condition.

▲

masklinn 3 hours ago | parent | next [-]

Does it tho? Assuming no torn reads/writes at those sizes, given the location should be strictly increasing are there situations where you could read a higher-than-stored value which would cause skipping a necessary update?

Afaik on all of x86, arm, and riscv an atomic load of a word sized datum is just a regular load.

	▲	gpderetta an hour ago \| parent [-]
		It doesn't need to be strictly increasing some other thread could be making other arbitrary operations. Still even in that case, as Dylan16807 pointed out, it likely doesn't matter.

▲

4 hours ago | parent | prev [-]

[deleted]