| ▲ | whizzter 3 hours ago | |||||||
I'd be super interested in how this compares between cpu architectures, is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus? | ||||||||
| ▲ | hansvm 3 hours ago | parent | next [-] | |||||||
I've observed the same behavior on AMD and Intel at $WORK. Our solution (ideal for us, reads happening roughly 1B times more often than writes) was to pessimize writes in favour of reads and add some per-thread state to prevent cache line sharing. We also tossed in an A/B system, so reads aren't delayed even while writes are happening; they just get stale data (also fine for our purposes). | ||||||||
| ||||||||
| ▲ | gpderetta 2 hours ago | parent | prev | next [-] | |||||||
the behaviour is quite typical for any MESI style cache coherence system (i.e. most if not all of them). A specific microarchitecture might alleviate this a bit with lower latency cross-core communication, but the solution (using a single naive RW lock to protect the cache) is inherently non-scalable. | ||||||||
| ▲ | PunchyHamster 2 hours ago | parent | prev [-] | |||||||
Read lock requires communication between cores. It just can't scale with CPU count | ||||||||