| ▲ | Sweepi 10 hours ago |
| "Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination." Will remember for the next time I write asm for Itanium! |
|
| ▲ | dlcarrier an hour ago | parent | next [-] |
| It would probably run really fast, considering that Itanium's downfall was the difficulty in compiling. (Including translating x86 instructions into Itanium instructions) |
| |
| ▲ | tliltocatl 8 minutes ago | parent [-] | | Not really. Itanium was a result of some people at Intel being obsessed by LINPACK benchmarks and forgetting everything else. It sucked for random memory access, and hence everything that's not floating-point number-crunching. Compiler can't hide memory access latency because it's fundamentally unpredictable. VLIW does magic for floating-point latency (which is predictable), but - As transistors got smaller, FP performance increased, memory latency stayed the same (or even increased). - If you are doing a lot of floating point, you are probably doing array processing, so might as well go for a GPU or at least SIMD). - Low instruction density is bad for I-cache. Yes, RISC fans, density matters! And VLIW is an absolute disaster in that regard. Again, this is less visible in number-crunching loads where the processor executes relatively small loops many times over. |
|
|
| ▲ | shawn_w 10 hours ago | parent | prev [-] |
| Quite a few architectures have a dedicated 0 register. |
| |
| ▲ | monocasa 14 minutes ago | parent | next [-] | | Very few architectures have a NAT bit though. | |
| ▲ | repelsteeltje 10 hours ago | parent | prev | next [-] | | Yep. The XOR trick - relying on special use of opcode rather than special register - is probably related to limited number of (general purpose) registers in typical '70 era CPU design (8080, 6502, Z80, 8086). | | |
| ▲ | classichasclass 5 hours ago | parent | next [-] | | Unfortunately, 6502 can't XOR the accumulator with itself. I don't recall if the Z80 can, and loading an immediate 0 would be most efficient on those anyway. | | |
| ▲ | blywi 5 hours ago | parent | next [-] | | XOR A absolutely works on Z80 and it's of course faster and shorter than loading a zero value with LD A,0.
LD A,0 is encoded to 2 bytes while XOR A is encoded as a single opcode.
XOR A has the additional benefit to also clear all the flags to 0. Sub A will clear the accumulator, but it will always set the N flag on Z80. | | |
| ▲ | eichin 13 minutes ago | parent | next [-] | | Yeah, the article seems to have missed the likely biggest reason that this is the popular x86 idiom - that it was already the popular 8080/Z80 idiom from the CP/M era, and there's a direct line (and a bunch of early 8086 DOS applications were mechanically translated assembly code, so while they are "different" architectures they're still solidly related.) | |
| ▲ | classichasclass 4 hours ago | parent | prev [-] | | Ah, thanks, I couldn't recall off the top of my head. |
| |
| ▲ | repelsteeltje 4 hours ago | parent | prev | next [-] | | You're absolutely right, I stand corrected. The 6502 gets by doing immediate load: 2 clock cycles, 2 bytes (frequently followed by single byte register transfer instruction). Out of curiosity I did a quick scan of the MOS 1.20 rom of the BBC micro: LDY #0 (a0 00): 38 hits
LDX #0 (a2 00): 28 hits
LDA #0 (a9 00): 48 hits
| |
| ▲ | bonzini 5 hours ago | parent | prev [-] | | The Z80 can do either LD A,0 or SUB A or XOR A, but the LD is slower due to the extra memory cycle to load the second byte of the instruction. |
| |
| ▲ | wongarsu 5 hours ago | parent | prev | next [-] | | And [as mentioned in the article] even modern x86 implementations have a zero register. So you have this weird special opcode that (when called with identical source and destination) only triggers register renaming | |
| ▲ | bonzini 5 hours ago | parent | prev [-] | | A move on SPARC is technically an OR of the source with the zero register. "move %l0, %l1" is assembled as "or %g0, %l0, %l1". So if you want to zero a register you OR %g0 with itself. |
| |
| ▲ | lynguist 10 hours ago | parent | prev | next [-] | | Indeed!! MIPS - $zero RISC-V - x0 SPARC - %g0 ARM64 - XZR | | |
| ▲ | classichasclass 5 hours ago | parent | next [-] | | PowerPC: "r0 occasionally" (with certain instructions like addi, though this might be better considered an edge case of encoding) | |
| ▲ | Findecanor 2 hours ago | parent | prev | next [-] | | On 64-bit ARM, the same register number is XZR in some instructions and the stack pointer in others. | |
| ▲ | matja 3 hours ago | parent | prev [-] | | Alpha: r31, f31 |
| |
| ▲ | signa11 10 hours ago | parent | prev [-] | | indeed. riscv for instance. also, afaik, xor’ing is faster. i would assume that someone like mr. raymond would know… | | |
| ▲ | IshKebab 10 hours ago | parent | next [-] | | > afaik, xor’ing is faster Even tiny tiny CPUs can do sub in one cycle, so I doubt that. On super-scalar CPUs xor and sub are normally issued to the same execution units so it wouldn't make a difference there either. | | |
| ▲ | tliltocatl 10 hours ago | parent [-] | | On superscalars running xor trick as is would be significantly slower because it implies a data dependency where there isn't one. But all OOO x86's optimize it away internally. |
| |
| ▲ | pif 10 hours ago | parent | prev [-] | | Which part of "mathematical operations don’t reset the NaT bit" did you not understand? |
|
|