Remix.run Logo
Sharlin 10 hours ago

As the author says, a couple of extra bytes still matter, perhaps more than 20ish years ago. There are vast amounts of RAM, sure, but it's glacially slow, and there's only a few tens of kBs of L1 instruction cache.

Never mind the fact that, as the author also mentions, the xor idiom takes essentially zero cycles to execute because nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

cogman10 10 hours ago | parent | next [-]

L1 instruction cache is backed by L2 and L3 caches.

For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if you have the X3D version).

I won't say it doesn't matter, but it doesn't matter as much as it once did. CPU caches have gotten huge while the instructions remain the same size.

The more important part, at this point, is it's idiomatic. That means hardware designers are much more likely to put in specialty logic to make sure it's fast. It's a common enough operation to deserve it's own special cases. You can fit a lot of 8 byte instructions into 1280kb of memory. And as it turns out, it's pretty common for applications to spend a lot of their time in small chunks of instructions. The slow part of a lot of code will be that `for loop` with the 30 AVX instructions doing magic. That's why you'll often see compilers burn `NOP` instructions to align a loop. That's to avoid splitting a cache line.

Sharlin 9 hours ago | parent | next [-]

> For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core)

Ryzen 9 CPUs have 1280kB of L1 in total. 80kB (48+32) per core, and the 9 series is the first in the entire history of Ryzens to have some other number than 64 (32+32) kilobytes of L1 per core. The 16MB L2 figure is also total. 1MB per core, same as the 7 series. AMD obviously touts the total, not per-core, amounts in their marketing materials because it looks more impressive.

monocasa 6 hours ago | parent | next [-]

Yeah, the reason for that is that it's expensive in PPA for the size of an L1 cache to exceed number of ways times page size. The jump to 48kB was also a jump to 12 way set associative.

As an aside, zen 1 did actually have a 64kB (and only 4 way!) L1I cache, but changed to the page size times way count restriction with zen 2, reducing the L1 size by half.

You can also see this on the apple side, where their giant 192kB caches L1I are 12 ways with a 16kB page size.

6 hours ago | parent [-]
[deleted]
kbolino 9 hours ago | parent | prev [-]

Also, rather importantly, the L1i (instruction) cache is still only 32 kB. The part that got bigger, the 48 kB of L1d (data) cache, does not count for this purpose.

9 hours ago | parent | prev | next [-]
[deleted]
gpderetta 8 hours ago | parent | prev [-]

Instruction caches also prefetch very well, as long as branch prediction is good. Of course on a misprediction you might also suffer a cache miss in addition to the normal penalty.

umanwizard 7 hours ago | parent | prev [-]

> nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

This is slightly inaccurate -- instructions retire in order, so it doesn't necessarily retire immediately after it's decoded and the new zeroed register is assigned. It has to sit in the reorder buffer waiting until all the instructions ahead of it are retired as well.

Thus in workloads where reorder buffer size is a bottleneck, it could contribute to that. However I doubt this describes most workloads.

Sharlin 7 hours ago | parent [-]

Thanks, that makes sense.