▲ | Tuna-Fish 5 days ago | |
On modern CPUs atomic adds are now reasonably fast, but only when they are uncontended. If the cache line the value is on has to bounce between cpus, that is usually +100ns (not cycles) or so. Writing performant parallel code always means absolutely minimizing communication between threads. | ||
▲ | loeg 5 days ago | parent [-] | |
Sure, but even the uncontended case is ~10x slower than regular ADD. |