Remix.run Logo
Tuna-Fish 5 days ago

On modern CPUs atomic adds are now reasonably fast, but only when they are uncontended. If the cache line the value is on has to bounce between cpus, that is usually +100ns (not cycles) or so.

Writing performant parallel code always means absolutely minimizing communication between threads.

loeg 5 days ago | parent [-]

Sure, but even the uncontended case is ~10x slower than regular ADD.