▲ | loeg 5 days ago | |||||||
Agner's instruction manual says "A LOCK prefix typically costs more than a hundred clock cycles," which might be dated but is directionally correct. (The atomic version is LOCK ADD.) If you go to the CPU-specific tables, LOCK ADD is like 10-50 (Zen 3: 8, Zen 2: 20, Bulldozer: 55, lol) cycles latency vs the expected 1 cycle for regular ADD. And about 10 cycles on Intel CPUs. So it can be starkly slower on some older AMD platforms, and merely ~10x slower on modern x86 platforms. | ||||||||
▲ | Tuna-Fish 5 days ago | parent [-] | |||||||
On modern CPUs atomic adds are now reasonably fast, but only when they are uncontended. If the cache line the value is on has to bounce between cpus, that is usually +100ns (not cycles) or so. Writing performant parallel code always means absolutely minimizing communication between threads. | ||||||||
|