The obvious answer is that XOR is faster. To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit. In XOR you don't have to do that because the output of every bit is independent of the other adjacent bits.

Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.

Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?

▲ Someone 5 hours ago | parent | next [-]

> To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit.

Yes, but that need not scale linearly with the number of bits. https://en.wikipedia.org/wiki/Carry-lookahead_adder:

“A carry-lookahead adder (CLA) or fast adder is a type of electronics adder used in digital logic. A carry-lookahead adder […] can be contrasted with the simpler, but usually slower, ripple-carry adder (RCA), for which the carry bit is calculated alongside the sum bit, and each stage must wait until the previous carry bit has been calculated to begin calculating its own sum bit and carry bit. The carry-lookahead adder calculates one or more carry bits before the sum, which reduces the wait time to calculate the result of the larger-value bits of the adder.

[…]

Already in the mid-1800s, Charles Babbage recognized the performance penalty imposed by the ripple-carry used in his difference engine, and subsequently designed mechanisms for anticipating carriage for his never-built analytical engine.[1][2] Konrad Zuse is thought to have implemented the first carry-lookahead adder in his 1930s binary mechanical computer, the Zuse Z1.”

I think most, if not all, current ALUs implement such adders.

	▲	dreamcompiler 4 hours ago \| parent [-]
		Carry lookahead is definitely faster than ripple carry but it's not free. It requires high-fan-in gates that take up a fair amount of silicon. That silicon saves time though, so as you say almost nobody uses ripple carry any more.

▲ svnt 10 hours ago | parent | prev | next [-]

His point is that in x86 there is no performance difference but everyone except his colleague/friend uses xor, while sub actually leaves cleaner flags behind. So he suspects its some kind of social convention selected at random and then propagated via spurious arguments in support (or that it “looks cooler” as a bit of a term of art).

It could also be as a result of most people working in assembly being aware of the properties of logic gates, so they carry the understanding that under the hood it might somehow be better.

▲

zahlman 5 hours ago | parent | next [-]

GP seems to think it strange that "x86" would actually not have a performance difference here.

I think this might just be due to not realizing just how far back in CPU history this goes.

▲

wongarsu 5 hours ago | parent [-]

In a clockless cpu design you'd indeed expect xor to be faster. But in a regular CPU with a clock you either waste a bit of xor performance by making xor and sub both take the same number of ticks, or you speed up the clock enough that the speed difference between xor and sub justifies sub being at least a full tick slower

The former just seems way more practical

▲

dbdr 4 hours ago | parent [-]

Even if they take the same number of ticks, shouldn't xor fundamentally needing less work also mean it can be performed while drawing less power/heating less, which is just as much an improvement in the long run?

	▲	MBCook 3 hours ago \| parent [-]
		That wasn’t much of a concern in the 70s and 80s.

▲

3form 10 hours ago | parent | prev [-]

I think an even more likely explanation would be that x86 assembly programmers often were, or learned from other-architecture assembly programmers. Maybe there's a place where it makes more sense and it can be so attributed. 6502 and 68k being first places I would look at.

	▲	richrichardsson 10 hours ago \| parent \| next [-]
		For 68k depending on the size you're interested in then it mostly doesn't matter. .b and .w -> clr eor sub are all identical for .l moveq #0 is the winner
	▲	bonzini 5 hours ago \| parent \| prev [-]
		6502 doesn't even have register-to-register ALU operations, there's no alternative to LDA #0. 8080/Z80 is probably where XOR A got a lead over SUB A, but they are also the same number of cycles.

▲ Symmetry 3 hours ago | parent | prev | next [-]

There's a structure called a carry-bypass adder[1] that lets you add two numbers in O(√n) time for only O(n) gates. That or a similar structure is what modern CPUs use and they allow you two add two numbers in a single clock cycle which is all you care about from a software perspective.

There are also tree adders which add in O(log(n)) time but use O(n^2) gates if you really need the speed, but AFAIK nobody actually does need to.

[1]https://en.wikipedia.org/wiki/Carry-skip_adder

▲ arka2147483647 10 hours ago | parent | prev | next [-]

> The answer is so obvious

A tangent, but what is Obvious depends on what you know.

Often experts don't explain the things they think are Obvious, but those things are only Obvious to them, because they are the expert.

We should all kind, and explain also the Obvious things those who do not know.

	▲	akie 10 hours ago \| parent [-]
		"The proof is left as an exercise for the reader" comes to mind

▲ flohofwoe 10 hours ago | parent | prev | next [-]

That comment is not very useful without pointing to realworld CPUs where SUB is more expensive than XOR ;)

E.g. on Z80 and 6502 both have the same cycle count.

▲

HarHarVeryFunny 6 hours ago | parent | next [-]

The 6502 doesn't support XOR A or SUB A, and in fact doesn't have a SUB opcode at all, only SBC (subtract with carry, requiring an extra opcode to set the carry flag beforehand).

▲

flohofwoe 5 hours ago | parent [-]

I was handwaving over the details, SBC is identical to SUB when the carry flag is clear, so it's understandable why the 6502 designers didn't waste an instruction slot.

EOR and SBC still have the same cycle counts though.

	▲	HarHarVeryFunny 4 hours ago \| parent [-]
		Sure, in some contexts you would know that the carry flag was set or clear (depending on what you needed), and it was common to take advantage of that and not add an explicit clc or sec, although you better comment the assumption/dependency on the preceding code. However the 6502 doesn't support reg-reg ALU operations, only reg-mem, so there simply is no xor a,a or sbc a,a support. You'd either have to do the explicit lda #0, or maybe use txa/tya if there was a free zero to be had.

▲

brigade 10 hours ago | parent | prev | next [-]

Cortex A8 vsub reads the second source register a cycle earlier than veor, so that can add one cycle latency

Not scalar, but still sub vs xor. Though you’d use vmov immediate for zeroing anyway.

▲

em3rgent0rdr 4 hours ago | parent | prev | next [-]

With more bits, then SUB is going to be more and more expensive to fit in the same number of clocks as XOR. So with an 8-bit CPU like Z80, it probably makes design sense to have XOR and SUB both take one cycle. But if for instance a CPU uses 128-bit registers, then the propagate-and-carry logic for ADD/SUB might take way much longer than XOR that the designers might not try to fit ADD/SUB into the same single clock cycle as XOR, and so might instead do multi-cycle pipelined ADD/SUB.

A real-world CPU example is the Cray-1, where S-Register Scalar Operations (64-bit) take 3 cycles for ADD/SUB but still only 1 cycle for XOR. [1]

[1] https://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM...

▲

6 hours ago | parent | prev | next [-]

[deleted]

▲

GoblinSlayer 9 hours ago | parent | prev [-]

Harvard Mark I? Not sure why people think programming started with Z80.

	▲	bonzini 5 hours ago \| parent \| next [-]
		The article is about x86, and x86 assembly is mostly a superset of 8080 (which is why machine language numbers registers as AX/CX/DX/BX, matching roughly the function of A/BC/DE/HL on the 8080—in particular with respect to BX and HL being last).
	▲	flohofwoe 5 hours ago \| parent \| prev [-]
		My WW2-era assembly is a bit rusty, but I don't think the Harvard Mark 1 had bitwise logical operations?

▲ adrian_b 8 hours ago | parent | prev | next [-]

XOR is faster when you do that alone in an FPGA or in an ASIC.

When you do XOR together with many other operations in an ALU (arithmetic-logical unit), the speed is determined by the slowest operation, so the speed of any faster operation does not matter.

This means that in almost all CPUs XOR and addition and subtraction have the same speed, despite the fact that XOR could be done faster.

In a modern pipelined CPU, the clock frequency is normally chosen so that a 64-bit addition can be done in 1 clock cycle, when including all the overheads caused by registers, multiplexers and other circuitry outside the ALU stages.

Operations more complex than 64-bit addition/subtraction have a latency greater than 1 clock cycle, even if one such operation can be initiated every clock cycle in one of the execution pipelines.

The operations less complex than 64-bit addition/subtraction, like XOR, are still executed in 1 clock cycle, so they do not have any speed advantage.

There have existed so-called superpipelined CPUs, where the clock frequency is increased, so that even addition/subtraction has a latency of 2 or more clock cycles.

Only in superpipelined CPUs it would be possible to have a XOR instruction that is faster than subtraction, but I do not know if this has ever been implemented in a real superpipelined CPU, because it could complicate the execution pipeline for negligible performance improvements.

Initially superpipelining was promoted by DEC as a supposedly better alternative to the superscalar processors promoted by IBM. However, later superpipelining was abandoned, because the superscalar approach provides better energy efficiency for the same performance. (I.e. even if for a few years it was thought that a Speed Demon beats a Brainiac, eventually it was proven that a Brainiac beats a Speed Demon, like shown in the Apple CPUs)

While mainstream CPUs do not use superpipelining, there have been some relatively recent IBM POWER CPUs that were superpipelined, but for a different reason than originally proposed. Those POWER CPUs were intended for having good performance only in multi-threaded workloads when using SMT, and not in single-thread applications. So by running simultaneous threads on the same ALU the multi-cycle latency of addition/subtraction was masked. This technique allowed IBM a simpler implementation of a CPU intended to run at 5 GHz or more, by degrading only the single-thread performance, without affecting the SMT performance. Because this would not have provided any advantage when using SMT, I assume that in those POWER CPUs XOR was not made faster than subtraction, even if this would have theoretically been possible.

▲ bialpio 8 hours ago | parent | prev | next [-]

From TFA:

The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely.

▲ mikequinlan 11 hours ago | parent | prev | next [-]

As TFA says, on x86 `sub eax, eax` encodes to the same number of bytes and executes in the same number of cycles.

▲

whizzter 9 hours ago | parent [-]

On modern ones, x86 has quite a history and the idiom might carry on from an even older machine.

Edit: Looked at comments, seems like x86 and the major 8bit cpu's had the same speed, pondering in this might be a remnant from the 4-bit ALU times.

▲

abainbridge 9 hours ago | parent | next [-]

> seems like x86 and the major 8bit cpu's had the same speed, pondering in this might be a remnant from the 4-bit ALU times.

I think that era of CPUs used a single circuit capable of doing add, sub, xor etc. They'd have 8 of them and the signals propagate through them in a row. I think this page explains the situation on the 6502: https://c74project.com/card-b-alu-cu/

And this one for the ARM 1: https://daveshacks.blogspot.com/2015/12/inside-alu-of-armv1-...

But I'm a software engineer speculating about how hardware works. You might want to ask a hardware engineer instead.

▲

adrian_b 7 hours ago | parent | prev [-]

Nope.

In any ALU the speed is determined by the slowest operation, so XOR is never faster. It does not matter which is the width of the ALU, all that matters is that an ALU does many kinds of operations, including XOR and subtraction, where the operation done by an ALU is selected by some control bits.

I have explained in another comment that the only CPUs where XOR can be faster than subtraction are the so-called superpipelined CPUs. Superpipelined CPUs have been made only after 1990 and there were very few such CPUs. Even if in superpipelined CPUs it is possible for XOR to be faster than subtraction, it is very unlikely that this feature has been implemented in anyone of the few superpipelined CPU models that have ever been made, because it would not have been worthwhile.

For general-purpose computers, there have never been "4-bit ALU times".

The first monolithic general-purpose processor was Intel 8008 (i.e. the monolithic version of Datapoint 2200), with an 8-bit ISA.

Intel claims that Intel 4004 was the first "microprocessor" (in order to move its priority earlier by one year), but that was not a processor for a general-purpose computer, but a calculator IC. Its only historical relevance for the history of personal computers is that the Intel team which designed 4004 gained a lot of experience with it and they established a logic design methodology with PMOS transistors, which they used for designing the Intel 8008 processor.

Intel 4004, its successors and similar 4-bit processors introduced later by Rockwell, TI and others, were suitable only for calculators or for industrial controllers, never for general-purpose computers.

The first computers with monolithic processors, a.k.a. microcomputers, used 8-bit processors, and then 16-bit processors, and so on.

For cost reduction, it is possible for an 8-bit ISA to use a 4-bit ALU or even just a serial 1-bit ALU, but this is transparent for the programmer and for general-purpose computers there never were 4-bit instruction sets.

▲

deathanatos 3 hours ago | parent [-]

> In any ALU the speed is determined by the slowest operation, so XOR is never faster.

On a 386, a reg/reg ADD is 2 cycles. An r32 IMUL is "9-38" cycles.

If what you stated were true, you'd be locking XOR's speed to that of DIV. (Or you do not consider MUL/DIV "arithmetic", or something.)

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/op...

> I have explained in another comment that the only CPUs where XOR can be faster than subtraction are the so-called superpipelined CPUs. Superpipelined CPUs have been made only after 1990 and there were very few such CPUs.

(And I'm choosing 386 to avoid it being "a superpipelined CPU".)

	▲	hcs 3 hours ago \| parent \| next [-]
		> Or you do not consider MUL/DIV "arithmetic", or something. Multiplier and divider are usually not considered part of the ALU, yes. Not uncommon for those to be shared between execution threads while there's an ALU for each.
	▲	adrian_b an hour ago \| parent \| prev [-]
		386 is a microprogrammed CPU where a multiplication is dome by a long sequence of microinstructions, including a loop that is executed a variable number of times, hence its long and variable execution time. A register-register operation required 2 microinstructions, presumably for an ALU operation and for writing back into the register file. Unlike the later 80486 which had execution pipelines that allowed consecutive ALU operations to be executed back-to-back, so the throughput was 1 ALU operation per clock cycle, in 80386 there was only some pipelining of the overall instruction execution, i.e. instruction fetching and decoding was overlapped with microinstruction execution, but there was no pipelining at a lower level, so it was not possible to execute ALU operations back to back. The fastest instructions required 2 clock cycles and most instructions required more clock cycles. In 80386, the ALU itself required the same 1 clock cycle for executing either XOR or SUB, but in order to complete 1 instruction the minimum time was 2 clock cycles. Moreover, this time of 2 clock cycles was optimistic, it assumed that the processor had succeeded to fetch and decode the instruction before the previous instruction was completed. This was not always true, so a XOR or a SUB could randomly require more than 2 clock cycles, when it needed to finish instruction decoding or fetching before doing the ALU operation. In very old or very cheap processors there are no dedicated multipliers and dividers, so a multiplication or division is done by a sequence of ALU operations. In any high performance processor, multiplications are done by dedicated multipliers and there are also dedicated division/square root devices with their own sequencers. The dividers may share some circuits with the multipliers, or not. When the dividers share some circuits with the multipliers, divisions and multiplications cannot be done concurrently. In many CPUs, the dedicated multipliers may share some surrounding circuits with an ALU, i.e. they may be connected to the same buses and they may be fed by the same scheduler port, so while a multiplication is executed the associated ALU cannot be used. Nevertheless the core multiplier and ALU remain distinct, because a multiplier and an ALU have very distinct structures. An ALU is built around an adder by adding a lot of control gates that allow the execution of related arithmetic operations, e.g. subtraction/comparison/increment/decrement and of bitwise operations. In cheaper CPUs the ALU can also do shifts and rotations, while in more performant CPUs there may be a dedicated shifter separated from the ALU. The term ALU can be used with 2 different senses. The strict sense is that an ALU is a digital adder augmented with control gates that allow the selection of any operation from a small set, typically of 8 or 16 or 32 operations, which are simple arithmetic or bitwise operations. Before the monolithic processors, computers were made using separate ALU circuits, like TI SN74181+SN74182 or circuits combining an ALU with registers, e.g. AMD 2901/2903. In the wide sense, ALU may be used to designate an execution unit of a processor, which may include many subunits, which may be ALUs in the strict sense, shifters, multipliers, dividers, shufflers etc. An ALU in the strict sense is the minimal kind of execution unit required by a processor. The modern high-performance processors have much more complex execution units.

▲ phire 10 hours ago | parent | prev | next [-]

I'm not actually aware of any CPUs that preform a XOR faster than a SUB. And more importantly, they have identical timings on the 8086, which is where this pattern comes from.

	▲	5 hours ago \| parent [-]
		[deleted]

▲ billpg 10 hours ago | parent | prev | next [-]

I had a similar reaction when learning 8086 assembly and finding the correct way to do `if x==y` was a CMP instruction which performed a subtraction and set only the flags. (The book had a section with all the branch instructions to use for a variety of comparison operators.) I think I spent a few minutes experimenting with XOR to see if I could fashion a compare-two-values-and-branch macro that avoided any subtraction.

	▲	rep_lodsb 3 hours ago \| parent [-]
		Comparing for equality can use either SUB or XOR: it sets the zero flag if (and only if) the two values are equal. That's why JE/JNE (jump if equal/not equal) is an alias for JZ/JNZ (jump if zero/not zero). There's also the TEST instruction, which does a logical AND but without storing the result (like CMP does for SUB). This can be used to test specific bits. Testing a single register for zero can be done in several ways, in addition to CMP with 0: `TEST AX,AX AND AX,AX OR AX,AX INC AX followed by DEC AX (or the other way around)` The 8080/Z80 didn't have TEST, but the other three were all in common use. Particularly INC/DEC, since it worked with all registers instead of just the accumulator. Also any arithmetic operation sets those flags, so you may not even need an explicit test. MOV doesn't set flags however, at least on x86 -- it does on some other architectures.

▲ Tepix 10 hours ago | parent | prev | next [-]

From TFA:

> It encodes to the same number of bytes, executes in the same number of cycles.

▲

abainbridge 9 hours ago | parent [-]

Those aren't the only resources. I could imagine XOR takes less energy because using it might activate less circuitry than SUB.

▲

zahlman 5 hours ago | parent [-]

I'm not aware of any stories in the historical record of "real programmers" optimizing for power use, only for speed or code size.

	▲	toast0 an hour ago \| parent \| next [-]
		Sibling posted a good example. But I know of (without details) things where you have to insert nops to keep peak power down, so the system doesn't brown out (in my experience, the 68hc11 won't take conditional branches if the power supply voltage dips too far; but I didn't work around that, I just made sure to use fresh batteries when my code started acting up). Especially during early boot. Apple got in a lot of trouble for reducing peak power without telling people, to avoid overloading dying batteries.
	▲	abainbridge 3 hours ago \| parent \| prev [-]
		For a few years I worked in the team that wrote software for an embedded audio DSP. The power draw to do something was normally more important than the speed. Eg when decoding MP3 or SBC you probably had enough MIPS to keep up with the stream rate, so the main thing the customers cared about was battery life. Mostly the techniques to optimize for speed were the same as those for power. But I remember being told that add/sub used less power than multiply even though both were single cycle. And that for loops with fewer than 16 instructions used less power because there was a simple 16 instruction program memory cache that saved the energy required to fetch instructions from RAM or ROM. (The RAM and ROM access was generally single cycle too). Nowadays, I expect optimizations that minimize energy consumption are an important target for LLM hosts.

▲ virexene 10 hours ago | parent | prev | next [-]

The operation is slightly more complex yes, but has there ever been an x86 CPU where SUB or XOR takes more than a single CPU cycle?

	▲	praptak 10 hours ago \| parent [-]
		I wonder if you could measure the difference in power consumption. I mean, not for zeroing because we know from the TFA that it's special-cased anyway. But maybe if you test on different registers?

▲ drob518 5 hours ago | parent | prev | next [-]

Yea, that’s what immediately went through my head, too. XOR is ALWAYS going to be single cycle because it’s bit-parallel.

▲ defmacr0 10 hours ago | parent | prev | next [-]

I would be surprised if modern CPUs didn't decode "xor eax, eax" into a set of micro-ops that simply moves from an externally invisible dedicated 0 register. These days the x86 ISA is more of an API contract than an actual representation of what the hardware internals do.

▲ defrost 10 hours ago | parent | next [-]

From TFA:

  The predominance of these idioms as a way to zero out a register led Intel to add special xor r, r-detection and sub r, r-detection in the instruction decoding front-end and rename the destination to an internal zero register, bypassing the execution of the instruction entirely. You can imagine that the instruction, in some sense, “takes zero cycles to execute”.

	▲	rasz 7 hours ago \| parent [-]
		"rename the destination to an internal zero register" That would be quite late then, 1997 Pentium 2 for general population.

▲ brigade 10 hours ago | parent | prev [-]

Zero micro ops to be precise, that’s handled entirely at the register rename stage with no data movement.

▲ feverzsj 10 hours ago | parent | prev | next [-]

It's like 0.5 cycles vs 0.9 cycles. So both are 1 cycle, considering synchronization.

▲

pishpash 10 hours ago | parent [-]

But energy consumption could be different for this hypothetical 0.5 and 0.9.

▲

scheme271 10 hours ago | parent [-]

Energy consumption wasn't really a concern when the idiom developed. I don't think people really cared about the energy consumption of instructions until well into the x86-64 era.

	▲	allenrb 5 hours ago \| parent [-]
		Not sure why this is being downvoted, but it’s absolutely correct. For most of the history of computing, people were happy that it worked at all. Being concerned about energy efficiency is a recent byproduct of mobile devices and, even more recently, giant amounts of compute adding up to gigawatts.

▲ jojobas 10 hours ago | parent | prev | next [-]

The non-obvious bit is why there isn't an even faster and shorter "mov <register>,0" instructions - the processors started short-circuiting xor <register>,<register> much later.

▲

defmacr0 5 hours ago | parent | next [-]

In x86, a basic immediate instruction with a 1 Byte immediate value is encoded like this:

<op> (1 Byte opcode), <Registers> (1 Byte), <immediate value> (1 Byte)

While xor eax, eax only uses 2 bytes. Since there are only 8 registers, meaning they can be encoded with 3 bits, you can pack two values into the <Registers> field (ModR/M).

Making mov eax, 0 only take two bytes would require significant changes of the ISA to allow immediate values in the ModR/M byte (or similar) but there would be little benefit since zeroing can already be done in 2 bytes and I doubt that other cases are even close to frequent enough for this to be any significant benefit overall. An actual improvement would be if there was a dedicated 1 Byte set-rax-to-0 instruction, but obviously that comes at a tradeoff where we have to encode another operation differently (probably with more bytes) again (and you can't zero anything else with it).

https://wiki.osdev.org/X86-64_Instruction_Encoding

https://pyokagan.name/blog/2019-09-20-x86encoding/

	▲	rep_lodsb 3 hours ago \| parent [-]
		Some other architectures like PDP-11 and 680x0 had a dedicated "clear register" instruction. It could have been added to x86, even as a group of single-byte opcodes with the register encoded in three bits (as with PUSH, POP, and INC/DEC outside of long mode). But the XOR idiom was already established on the 8080 by that point.

▲

HarHarVeryFunny 5 hours ago | parent | prev | next [-]

A number of the RISC processors have a special zero register, giving you a "mov reg, zero" instruction.

Of course many of the RISC processors also have fixed length instructions, with small literal values being encoded as part of the instruction, so "mov reg, #0" and "mov reg, zero" would both be same length.

▲

drob518 5 hours ago | parent | prev | next [-]

Right, like a “set reg to zero” instruction. One byte. Just encodes the operation and the reg to zero. I’m surprised we didn’t have it on those old processors. Maybe the thinking was that it was already there: xor reg,reg.

▲

bonzini 5 hours ago | parent | next [-]

One byte instructions, with 8 registers as in the 8086, waste 8 opcodes which is 3% of the total. There are just five: "INC reg", "DEC reg", "PUSH reg", "POP reg", "XCHG AX, reg" (which is 7 wasted opcodes instead of 8, because "XCHG AX, AX" doubles as NOP).

One-byte INC/DEC was dropped with x86-64, and PUSH/POP are almost obsolete in APX due to its addition of PUSH2/POP2, leaving only the least useful of the five in the most recent incantation of the instruction set.

▲

drob518 5 hours ago | parent [-]

I’m not sure I understand what you mean by “waste 8 opcodes.”

▲

GuB-42 4 hours ago | parent | next [-]

There are only 256 1-byte opcodes or prefixes available, if you take 8 of these to zero registers, they won't be available for other instruction, and unless you consider zeroing to be so important that they really need their 1-byte opcodes, it is redundant since you can use the 2-byte "xor reg,reg" instead, hence the "waste'.

In addition, you would need 16 opcodes, not 8, if you also wanted to cover 8 bit registers (AH/AL,...).

Special shout-out to the undocumented SALC instruction, which puts the carry flag into AL. If you know that the carry will be 0, it is a nice sizecoding trick to zero AL in 1 byte.

▲

bonzini 4 hours ago | parent | prev | next [-]

They occupy 8 of the possible 256 byte values. Together, those five cases used about 15% of the space.

Though I was forgetting one important case: MOV r,imm also used one-byte opcodes with the register index embedded. And it came in byte and word variants, so it used a further 16 opcodes bytes for a total of 56 one byte opcodes with register encoding.

	▲	drob518 an hour ago \| parent [-]
		Gotcha, thanks for clarifying. I was reacting to the word “waste” I guess. Surely, as you say, it consumes that opcode encoding space. Whether that’s a waste or not depends on a lot of other things, I suppose. I wasn’t necessarily thinking x86-specific in my original comment. But yea, if you try to zero every possible register and half-word register you would definitely consume lots of encoding space.

▲

LegionMammal978 4 hours ago | parent | prev | next [-]

Traditionally in x86, only the first byte is the opcode used to select the instruction, and any further bytes contain only operands. Thus, since there exist 256 possible values for the initial byte, there are at most 256 possible opcodes to represent different instructions.

So if you add a 1-byte instruction for each register to zero its value, that consumes 8 of the possible 256 opcodes, since there are 8 registers. Traditional x86 did have several groups of 1-byte instructions for common operations, but most of them were later replaced with multibyte encodings to free up space for other instructions.

▲

gpderetta 4 hours ago | parent | prev [-]

special mov 0 instruction times 8 registers. The opcode space, especially 1 byte opcode space, is precious so encoding redundant operations is wasteful.

▲

flohofwoe 5 hours ago | parent | prev [-]

Instruction slots are extremely valuable in 8-bit instruction sets. The Z80 has some free slots left in the ED-prefixed instruction subset, but being prefix-instructions means they could at best run at half speed of one-byte instructions (8 vs 4 clock cycles).

▲

10 hours ago | parent | prev [-]

[deleted]

▲ themafia 10 hours ago | parent | prev | next [-]

XOR and SUB have had identical cycle counts and latencies since the 8088. That's because you can "look ahead" when doing carries in binary. It's just a matter of how much floorspace on the chip you want to use.

https://en.wikipedia.org/wiki/Carry-lookahead_adder

The only minor difference between the two on x86, really, is SUB sets OF and CF according to the result while XOR always clears them.

▲

asQuirreL 10 hours ago | parent | next [-]

A carry lookahead adder makes your circuit depth logarithmic in the width of the inputs vs linear for a ripple carry adder, but that is still asymptotically worse than XORs constant depth.

(But this does not discount the fact that basically all CPUs treat them both as one cycle)

▲

bonzini 5 hours ago | parent | prev [-]

OF/CF/AF are always cleared anyway by SUB r,r. So there's absolutely no difference.

▲

themafia an hour ago | parent [-]

The point is OF/CF are sometimes dependent on the inputs for SUB. They never are for XOR.

	▲	bonzini 12 minutes ago \| parent [-]
		Ah, you mean in terms of complexity of the calculation. Thanks for clarifying. In practice AF and CF can be computed from the carry out vector which is already available, and OF is a single XOR (of the two most significant bits of the carry out vector). The same circuitry works for XOR and SUB if the carry out vector of XOR is simply all zeroes.

▲ bahmboo 10 hours ago | parent | prev | next [-]

Because he is explicitly talking about x86 - maybe you missed that.

▲ TacticalCoder 7 hours ago | parent | prev [-]

> The obvious answer is that XOR is faster.

It used to be not only faster but also smaller. And back then this mattered.

Say you had a computer running at 33 Mhz, you had 33 million cycles per second to do your stuff. A 60 Hz game? 33 million / 60 and suddenly you only have about 500 000 cycles per frame. 200 scanlines? Suddenly you're left with only 2500 cycles per scanline to do your stuff. And 2500 cycles really isn't that much.

So every cycle counted back then. We'd use the official doc and see how many cycles each instruction would take. And we'd then verify by code that this was correct too. And memory mattered too.

XOR was both faster and smaller (less bytes) then a MOV ..., 0.

Full stop.

And when those CPU first began having cache, the cache were really tiny at first: literally caching ridiculously low number of CPU instructions. We could actually count the size of the cache manually (for example by filling with a few NOP instructions then modifying them to, say, add one, and checking which result we got at the end).

XOR, due to being smaller, allowed to put more instructions in the cache too.

Now people may lament that it persisted way long after our x86 CPUs weren't even real x86 CPUs anymore and that is another topic.

But there's a reason XOR was used and people should deal with it.

We zero with XOR EAX,EAX and that's it.

	▲	zahlman 5 hours ago \| parent [-]
		The context was comparison to SUB EAX,EAX, not to a MOV.