| ▲ | burnt-resistor 2 days ago |
| In ye olden days, bit manip operations were faster than algebraic operations. And sometimes even faster than a load immediate, hence XOR AX, AX instead of MOV AX, 0. |
|
| ▲ | GuB-42 2 days ago | parent | next [-] |
| "xor ax, ax" is still in use today. The main advantage is that it is shorter, just 2 bytes instead of 3 for the immediate, the difference is bigger in 32 and 64 bit mode as you have to have all these zeroes in the instruction. Shorter usually mean faster, even if the instruction itself isn't faster. |
| |
| ▲ | sparkie 2 days ago | parent | next [-] | | In long mode, compilers will typically emit `xor eax, eax`, as it only needs 2 bytes: The opcode and modrm byte. `xor ax, ax` takes 3 bytes due to the operand size override prefix (0x66), and `xor rax, rax` takes 3 bytes due to the REX.W prefix. `xor eax, eax` will still clear the full 64-bit register. Shorter basically means you can fit more in instruction cache, which should in theory improve performance marginally. | | |
| ▲ | Someone 2 days ago | parent [-] | | Size isn’t everything. You should start by reading the manual for your CPU to see what it advises. The micro-architecture may treat only one of the sequences specially. For modern x64, I think that indeed is the shorter xor sequence, where, internally, the CPU just renames the register to a register that always contains zero, making the instruction independent of any earlier instructions using eax. IIRC, Intel said a mov was the way to go for some now ancient x86 CPUs, though. |
| |
| ▲ | tyfighter 2 days ago | parent | prev | next [-] | | Modern x86 implementations don't even do the XOR. It just renames the register to "zero". | |
| ▲ | burnt-resistor 2 days ago | parent | prev [-] | | Barely. x86 is fading. Arm doesn't do this in GCC or Clang. > Shorter usually means faster It depends, so spouting generalities doesn't mean anything. Instruction cache line filling vs. cycle reduction vs. reservation station ordering is typically a compiler constraints optimization problem(s). | | |
| ▲ | userbinator 2 days ago | parent [-] | | Arm doesn't do this in GCC or Clang. Because Arm64 has a zero register, and Arm32 has small immediates, and all instructions are uniformly long. |
|
|
|
| ▲ | heisenbit a day ago | parent | prev [-] |
| And in these modern days it matters that an algorithm can use divide and conquer and can be parallelized. Xor plays nice here. Also the lack of carry bits and less branching help in the crypto space. |