Remix.run Logo
stinkbeetle 2 days ago

Cache must be physically organized as 64 byte lines. Cache line size is most important for software for two things:

- Architectural interfaces like (I think, I don't really know aarch64) DC CVAU. These don't necessarily have to reflect physical cache organization, cleaning a "line" could clean two physical lines.

- Performance. The only thing you really care about is behavior on stores and on load misses for avoiding false sharing cache line bouncing problems.

It's possible that either they think 128 byte lines will be helpful for performance and hope they could switch over after legacy software goes away, seeding their mac ecosystem with 128 byte lines now, or that 128 byte line behavior actually does offer some performance benefit and they have a mode that basically gangs two lines together (Pentium 4 had similar IIRC) so it has performance characteristics of a 128 byte line.

loeg a day ago | parent [-]

Early x86 prefetcher would fetch two adjacent cache lines, so despite a 64 byte physical size, in practice adjacent lines would cause false-sharing. This is mostly historical, though it's relatively common to use a 128 byte line size on x86, still. E.g., https://github.com/facebook/folly/blob/main/folly/lang/Align... (Sandy Bridge was a 2011 CPU). (Clang's impl of the std version of these constants uses 64 for both on x86: https://godbolt.org/z/r1fdYTWEn .)

menaerus 15 hours ago | parent [-]

I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing.

Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:

    Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line. To software, the L2 streamer’s functionality is similar to the adjacent cache line prefetch mechanism found in processors based on Intel NetBurst microarchitecture.
loeg 4 hours ago | parent | next [-]

Not sure where you're getting L1 from. The folly comment doesn't mention it.

FWIW, IA-32 SDM Vol 3A, section 10.10.6.7 still explicitly recommends 128-byte alignment for locks on NetBurst uarch (~Pentium 4) processors.

> 10.10.6.7 Place Locks and Semaphores in Aligned, 128-Byte Blocks of Memory

> When software uses locks or semaphores to synchronize processes, threads, or other code sections; Intel recommends that only one lock or semaphore be present within a cache line (or 128 byte sector, if 128-byte sector is supported). In processors based on Intel NetBurst microarchitecture (which support 128-byte sector consisting of two cache lines), following this recommendation means that each lock or semaphore should be contained in a 128-byte block of memory that begins on a 128-byte boundary. The practice minimizes the bus traffic required to service locks.

There is also an interesting table of cache line and "sector" sizes in Table 13-1 (it's all 64-byte cache lines in newer Intel CPUs).

stinkbeetle 14 hours ago | parent | prev [-]

> I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing.

CPU0 stores to byte 0x10 and dirties CL0 (0x00-0x40). CPU1 loads byte 0x50 in a different data structure which is in CL1, and its adjacent line prefetcher also loads CL0, which is what Pentium 4 did.

> Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:

Seems plausible.