▲ | menaerus 15 hours ago | |
I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing. Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:
| ||
▲ | loeg 4 hours ago | parent | next [-] | |
Not sure where you're getting L1 from. The folly comment doesn't mention it. FWIW, IA-32 SDM Vol 3A, section 10.10.6.7 still explicitly recommends 128-byte alignment for locks on NetBurst uarch (~Pentium 4) processors. > 10.10.6.7 Place Locks and Semaphores in Aligned, 128-Byte Blocks of Memory > When software uses locks or semaphores to synchronize processes, threads, or other code sections; Intel recommends that only one lock or semaphore be present within a cache line (or 128 byte sector, if 128-byte sector is supported). In processors based on Intel NetBurst microarchitecture (which support 128-byte sector consisting of two cache lines), following this recommendation means that each lock or semaphore should be contained in a 128-byte block of memory that begins on a 128-byte boundary. The practice minimizes the bus traffic required to service locks. There is also an interesting table of cache line and "sector" sizes in Table 13-1 (it's all 64-byte cache lines in newer Intel CPUs). | ||
▲ | stinkbeetle 14 hours ago | parent | prev [-] | |
> I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing. CPU0 stores to byte 0x10 and dirties CL0 (0x00-0x40). CPU1 loads byte 0x50 in a different data structure which is in CL1, and its adjacent line prefetcher also loads CL0, which is what Pentium 4 did. > Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout: Seems plausible. |