I remember a lot of code zeroing registrers, dating at least back from the IBM PC XT days (before the 80286).

If you decode the instruction, it makes sense to use XOR:

- mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

This extra byte in a machine with less than 1 Megabyte of memory did id matter.

In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)

Here Intel made the decision to use only 2 bytes. I bet this helps both the instruction decoder and (of course) saves more memory than the old 8086 instruction.

▲

Sharlin 10 hours ago | parent | next [-]

As the author says, a couple of extra bytes still matter, perhaps more than 20ish years ago. There are vast amounts of RAM, sure, but it's glacially slow, and there's only a few tens of kBs of L1 instruction cache.

Never mind the fact that, as the author also mentions, the xor idiom takes essentially zero cycles to execute because nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

▲

cogman10 10 hours ago | parent | next [-]

L1 instruction cache is backed by L2 and L3 caches.

For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if you have the X3D version).

I won't say it doesn't matter, but it doesn't matter as much as it once did. CPU caches have gotten huge while the instructions remain the same size.

The more important part, at this point, is it's idiomatic. That means hardware designers are much more likely to put in specialty logic to make sure it's fast. It's a common enough operation to deserve it's own special cases. You can fit a lot of 8 byte instructions into 1280kb of memory. And as it turns out, it's pretty common for applications to spend a lot of their time in small chunks of instructions. The slow part of a lot of code will be that `for loop` with the 30 AVX instructions doing magic. That's why you'll often see compilers burn `NOP` instructions to align a loop. That's to avoid splitting a cache line.

▲

Sharlin 9 hours ago | parent | next [-]

> For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core)

Ryzen 9 CPUs have 1280kB of L1 in total. 80kB (48+32) per core, and the 9 series is the first in the entire history of Ryzens to have some other number than 64 (32+32) kilobytes of L1 per core. The 16MB L2 figure is also total. 1MB per core, same as the 7 series. AMD obviously touts the total, not per-core, amounts in their marketing materials because it looks more impressive.

▲

monocasa 6 hours ago | parent | next [-]

Yeah, the reason for that is that it's expensive in PPA for the size of an L1 cache to exceed number of ways times page size. The jump to 48kB was also a jump to 12 way set associative.

As an aside, zen 1 did actually have a 64kB (and only 4 way!) L1I cache, but changed to the page size times way count restriction with zen 2, reducing the L1 size by half.

You can also see this on the apple side, where their giant 192kB caches L1I are 12 ways with a 16kB page size.

	▲	6 hours ago \| parent [-]
		[deleted]

▲

kbolino 9 hours ago | parent | prev [-]

Also, rather importantly, the L1i (instruction) cache is still only 32 kB. The part that got bigger, the 48 kB of L1d (data) cache, does not count for this purpose.

▲

9 hours ago | parent | prev | next [-]

[deleted]

▲

gpderetta 8 hours ago | parent | prev [-]

Instruction caches also prefetch very well, as long as branch prediction is good. Of course on a misprediction you might also suffer a cache miss in addition to the normal penalty.

▲

umanwizard 7 hours ago | parent | prev [-]

> nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

This is slightly inaccurate -- instructions retire in order, so it doesn't necessarily retire immediately after it's decoded and the new zeroed register is assigned. It has to sit in the reorder buffer waiting until all the instructions ahead of it are retired as well.

Thus in workloads where reorder buffer size is a bottleneck, it could contribute to that. However I doubt this describes most workloads.

	▲	Sharlin 7 hours ago \| parent [-]
		Thanks, that makes sense.

▲

vardump 11 hours ago | parent | prev | next [-]

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

You don't need operand size prefix 0x66 when running 16 bit code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax" is just 2 bytes.

	▲	eb0la 11 hours ago \| parent [-]
		My fault: I just compiled the instruction with an assembler instead of looking up the actual instruction from documentation. It makes much more sense: resetting ax, and bc (xor ax,ax ; xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster to fetch by the x86 than the 3-octet version I wrote before.

▲

Someone 10 hours ago | parent | prev | next [-]

> If you decode the instruction, it makes sense to use XOR:

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

Except, apparently, on the pentium Pro, according to this comment: https://randomascii.wordpress.com/2012/12/29/the-surprising-..., which says:

“But there was at least one out-of-order design that did not recognize xor reg, reg as a special case: the Pentium Pro. The Intel Optimization manuals for the Pentium Pro recommended “mov” to zero a register.”

	▲	qingcharles 22 minutes ago \| parent [-]
		That's weird, I looked it up earlier and found the P6 (Pentium Pro) was the first to actually make the xor optimization into a zero clock operation. https://fanael.github.io/archives/topic-microarchitecture-ar...

▲

RHSeeger 11 hours ago | parent | prev | next [-]

> the IBM PC XT days (before the 80286)

Fun fact - the IBM PC XT also came in a 286 model (the XT 286).

▲

eb0la 11 hours ago | parent [-]

You're right. I forgot that!

	▲	RHSeeger 24 minutes ago \| parent [-]
		To be fair, I only remember because that was the 2nd computer I owned.

▲

Anarch157a 10 hours ago | parent | prev | next [-]

I don't know enough of the 8086 so I don't know if this works the same, but on the Z80 (which means it was probably true for the 8080 too), XOR A would also clear pretty much all bits on the flag register, meaning the flags would be in a known state before doing something that could affect them.

	▲	vanderZwan 10 hours ago \| parent [-]
		Which I guess is the same reason why modern Intel CPU pipelines can rely on it for pipelining.

▲

chasd00 7 hours ago | parent | prev [-]

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

iirc doesn't word alignment matter? I have no idea if this is how the IBM PC XT was aligned but if you had 4 byte words then it doesn't matter if you save a byte with xor because you wouldn't be able to use it for anything else anyway. again, iirc.