Plenty of interesting details.

> Regarding cache-line size, sysctl on macOS reports a value of 128 B, while getconf and the CTR_EL0 register on Asahi Linux returns 64 B, which is also supported by our measurements.

How would this be even possible?

▲ stinkbeetle 2 days ago | parent | next [-]

Cache must be physically organized as 64 byte lines. Cache line size is most important for software for two things:

- Architectural interfaces like (I think, I don't really know aarch64) DC CVAU. These don't necessarily have to reflect physical cache organization, cleaning a "line" could clean two physical lines.

- Performance. The only thing you really care about is behavior on stores and on load misses for avoiding false sharing cache line bouncing problems.

It's possible that either they think 128 byte lines will be helpful for performance and hope they could switch over after legacy software goes away, seeding their mac ecosystem with 128 byte lines now, or that 128 byte line behavior actually does offer some performance benefit and they have a mode that basically gangs two lines together (Pentium 4 had similar IIRC) so it has performance characteristics of a 128 byte line.

▲ loeg a day ago | parent [-]

Early x86 prefetcher would fetch two adjacent cache lines, so despite a 64 byte physical size, in practice adjacent lines would cause false-sharing. This is mostly historical, though it's relatively common to use a 128 byte line size on x86, still. E.g., https://github.com/facebook/folly/blob/main/folly/lang/Align... (Sandy Bridge was a 2011 CPU). (Clang's impl of the std version of these constants uses 64 for both on x86: https://godbolt.org/z/r1fdYTWEn .)

▲ menaerus 15 hours ago | parent [-]

I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing.

Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:

    Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line. To software, the L2 streamer’s functionality is similar to the adjacent cache line prefetch mechanism found in processors based on Intel NetBurst microarchitecture.

	▲	loeg 4 hours ago \| parent \| next [-]
		Not sure where you're getting L1 from. The folly comment doesn't mention it. FWIW, IA-32 SDM Vol 3A, section 10.10.6.7 still explicitly recommends 128-byte alignment for locks on NetBurst uarch (~Pentium 4) processors. > 10.10.6.7 Place Locks and Semaphores in Aligned, 128-Byte Blocks of Memory > When software uses locks or semaphores to synchronize processes, threads, or other code sections; Intel recommends that only one lock or semaphore be present within a cache line (or 128 byte sector, if 128-byte sector is supported). In processors based on Intel NetBurst microarchitecture (which support 128-byte sector consisting of two cache lines), following this recommendation means that each lock or semaphore should be contained in a 128-byte block of memory that begins on a 128-byte boundary. The practice minimizes the bus traffic required to service locks. There is also an interesting table of cache line and "sector" sizes in Table 13-1 (it's all 64-byte cache lines in newer Intel CPUs).
	▲	stinkbeetle 14 hours ago \| parent \| prev [-]
		> I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing. CPU0 stores to byte 0x10 and dirties CL0 (0x00-0x40). CPU1 loads byte 0x50 in a different data structure which is in CL1, and its adjacent line prefetcher also loads CL0, which is what Pentium 4 did. > Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout: Seems plausible.

▲ loeg a day ago | parent | prev [-]

Are you asking how it is possible for a sysctl to report the wrong value? It's trivial for the OS to return whatever it likes for a sysctl; they're just software. (The sysctl is wrong.)

	▲	menaerus 16 hours ago \| parent [-]
		No, that's not what I was wondering. Cache-line size being a HW property it is not exactly "configurable", although I guess technically it could be, and I was confused how is it that the Apple sysctl returns 128B, which is a ground truth, and this paper then says that their measurements support the 64B cache-line size reported by Asahi Linux. I think that the measurements are a hard evidence, and if they are not incorrect, why would Apple sysctl return 128B then? I am actually wondering if Apple M design really supports two different cache-line sizes, 64B and 128B respectively, but the mode is somehow configurable.