Remix.run Logo
menaerus 15 hours ago

I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing.

Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:

    Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line. To software, the L2 streamer’s functionality is similar to the adjacent cache line prefetch mechanism found in processors based on Intel NetBurst microarchitecture.
loeg 4 hours ago | parent | next [-]

Not sure where you're getting L1 from. The folly comment doesn't mention it.

FWIW, IA-32 SDM Vol 3A, section 10.10.6.7 still explicitly recommends 128-byte alignment for locks on NetBurst uarch (~Pentium 4) processors.

> 10.10.6.7 Place Locks and Semaphores in Aligned, 128-Byte Blocks of Memory

> When software uses locks or semaphores to synchronize processes, threads, or other code sections; Intel recommends that only one lock or semaphore be present within a cache line (or 128 byte sector, if 128-byte sector is supported). In processors based on Intel NetBurst microarchitecture (which support 128-byte sector consisting of two cache lines), following this recommendation means that each lock or semaphore should be contained in a 128-byte block of memory that begins on a 128-byte boundary. The practice minimizes the bus traffic required to service locks.

There is also an interesting table of cache line and "sector" sizes in Table 13-1 (it's all 64-byte cache lines in newer Intel CPUs).

stinkbeetle 14 hours ago | parent | prev [-]

> I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing.

CPU0 stores to byte 0x10 and dirties CL0 (0x00-0x40). CPU1 loads byte 0x50 in a different data structure which is in CL1, and its adjacent line prefetcher also loads CL0, which is what Pentium 4 did.

> Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:

Seems plausible.