"Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination."

Will remember for the next time I write asm for Itanium!

▲ dlcarrier an hour ago | parent | next [-]

It would probably run really fast, considering that Itanium's downfall was the difficulty in compiling. (Including translating x86 instructions into Itanium instructions)

	▲	tliltocatl 8 minutes ago \| parent [-]
		Not really. Itanium was a result of some people at Intel being obsessed by LINPACK benchmarks and forgetting everything else. It sucked for random memory access, and hence everything that's not floating-point number-crunching. Compiler can't hide memory access latency because it's fundamentally unpredictable. VLIW does magic for floating-point latency (which is predictable), but - As transistors got smaller, FP performance increased, memory latency stayed the same (or even increased). - If you are doing a lot of floating point, you are probably doing array processing, so might as well go for a GPU or at least SIMD). - Low instruction density is bad for I-cache. Yes, RISC fans, density matters! And VLIW is an absolute disaster in that regard. Again, this is less visible in number-crunching loads where the processor executes relatively small loops many times over.

▲ shawn_w 10 hours ago | parent | prev [-]

Quite a few architectures have a dedicated 0 register.

▲ monocasa 14 minutes ago | parent | next [-]

Very few architectures have a NAT bit though.

▲ repelsteeltje 10 hours ago | parent | prev | next [-]

Yep. The XOR trick - relying on special use of opcode rather than special register - is probably related to limited number of (general purpose) registers in typical '70 era CPU design (8080, 6502, Z80, 8086).

▲ classichasclass 5 hours ago | parent | next [-]

Unfortunately, 6502 can't XOR the accumulator with itself. I don't recall if the Z80 can, and loading an immediate 0 would be most efficient on those anyway.

▲ blywi 5 hours ago | parent | next [-]

XOR A absolutely works on Z80 and it's of course faster and shorter than loading a zero value with LD A,0. LD A,0 is encoded to 2 bytes while XOR A is encoded as a single opcode. XOR A has the additional benefit to also clear all the flags to 0. Sub A will clear the accumulator, but it will always set the N flag on Z80.

	▲	eichin 13 minutes ago \| parent \| next [-]
		Yeah, the article seems to have missed the likely biggest reason that this is the popular x86 idiom - that it was already the popular 8080/Z80 idiom from the CP/M era, and there's a direct line (and a bunch of early 8086 DOS applications were mechanically translated assembly code, so while they are "different" architectures they're still solidly related.)
	▲	classichasclass 4 hours ago \| parent \| prev [-]
		Ah, thanks, I couldn't recall off the top of my head.

▲ repelsteeltje 4 hours ago | parent | prev | next [-]

You're absolutely right, I stand corrected.

The 6502 gets by doing immediate load: 2 clock cycles, 2 bytes (frequently followed by single byte register transfer instruction). Out of curiosity I did a quick scan of the MOS 1.20 rom of the BBC micro:

  LDY #0 (a0 00): 38 hits
  LDX #0 (a2 00): 28 hits
  LDA #0 (a9 00): 48 hits

▲ bonzini 5 hours ago | parent | prev [-]

The Z80 can do either LD A,0 or SUB A or XOR A, but the LD is slower due to the extra memory cycle to load the second byte of the instruction.

▲ wongarsu 5 hours ago | parent | prev | next [-]

And [as mentioned in the article] even modern x86 implementations have a zero register. So you have this weird special opcode that (when called with identical source and destination) only triggers register renaming

▲ bonzini 5 hours ago | parent | prev [-]

A move on SPARC is technically an OR of the source with the zero register. "move %l0, %l1" is assembled as "or %g0, %l0, %l1". So if you want to zero a register you OR %g0 with itself.

▲ lynguist 10 hours ago | parent | prev | next [-]

Indeed!!

MIPS - $zero

RISC-V - x0

SPARC - %g0

ARM64 - XZR

	▲	classichasclass 5 hours ago \| parent \| next [-]
		PowerPC: "r0 occasionally" (with certain instructions like addi, though this might be better considered an edge case of encoding)
	▲	Findecanor 2 hours ago \| parent \| prev \| next [-]
		On 64-bit ARM, the same register number is XZR in some instructions and the stack pointer in others.
	▲	matja 3 hours ago \| parent \| prev [-]
		Alpha: r31, f31

▲ signa11 10 hours ago | parent | prev [-]

indeed. riscv for instance. also, afaik, xor’ing is faster. i would assume that someone like mr. raymond would know…

▲

IshKebab 10 hours ago | parent | next [-]

> afaik, xor’ing is faster

Even tiny tiny CPUs can do sub in one cycle, so I doubt that. On super-scalar CPUs xor and sub are normally issued to the same execution units so it wouldn't make a difference there either.

	▲	tliltocatl 10 hours ago \| parent [-]
		On superscalars running xor trick as is would be significantly slower because it implies a data dependency where there isn't one. But all OOO x86's optimize it away internally.

▲

pif 10 hours ago | parent | prev [-]

Which part of "mathematical operations don’t reset the NaT bit" did you not understand?