I'd be super interested in how this compares between cpu architectures, is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?

▲

hansvm 3 hours ago | parent | next [-]

I've observed the same behavior on AMD and Intel at $WORK. Our solution (ideal for us, reads happening roughly 1B times more often than writes) was to pessimize writes in favour of reads and add some per-thread state to prevent cache line sharing.

We also tossed in an A/B system, so reads aren't delayed even while writes are happening; they just get stale data (also fine for our purposes).

	▲	the_duke 2 hours ago \| parent [-]
		Rust has an interesting crate for this, arc-swap [1]. It's essentially just an atomic pointer that can be swapped out. [1] https://docs.rs/arc-swap/latest/arc_swap/

▲

gpderetta 2 hours ago | parent | prev | next [-]

the behaviour is quite typical for any MESI style cache coherence system (i.e. most if not all of them).

A specific microarchitecture might alleviate this a bit with lower latency cross-core communication, but the solution (using a single naive RW lock to protect the cache) is inherently non-scalable.

▲

PunchyHamster 2 hours ago | parent | prev [-]

Read lock requires communication between cores. It just can't scale with CPU count