| ▲ | RiverCrochet 4 hours ago |
| XOR is a simple logic-gate operation. SUB would have to be an ALU operation. A one-bit adder (which is subtraction in reverse) makes signals pass through two gates. See https://en.wikipedia.org/wiki/Adder_(electronics) You need the 2 gates for adding/subtracting because you care about carry. So if you're adding/subtracting 8 bits, 16 bits, or more, you're connecting multiples of these together, and that carry has to ripple through all the rest of the gates one-by-one. It can't be paralellized without extra circuitry, which increases your costs in other ways. Without the AND gate needed for carry, all the XORs can fire off at the same time. If you added the extra circuitry for a parallelizable add/subtract to make it as fast as XOR, your actual parallel XOR would consume less power. |
|
| ▲ | Symmetry 3 hours ago | parent | next [-] |
| That's all true, but on any modern x86 processor both the single pair of gates for the xor and the 10 or so for a carry-bypass 64 bit wide subtraction both happen with a single clock cycle of latency so from a programmer's perspective they're the same in that sense. There's still an energy difference but its tiny compared to what even the register file and bypass network for the operation use, let along the OoO structures. |
| |
| ▲ | masklinn 2 hours ago | parent | next [-] | | The question is why one idiom won over the other, which happened a long time ago. Because as the article notes on "any modern x86 processor" both xor r, r and sub r, r are handled by the frontend and have essentially no cost. | |
| ▲ | dataflow 3 hours ago | parent | prev | next [-] | | The question isn't whether they both take a clock cycle, but rather whether any future implementation of the ISA might ostensibly find some sort of performance advantage, even if none do right now. From that standpoint, xor seems like a safer bet. | | |
| ▲ | Symmetry an hour ago | parent | next [-] | | There's been a lot of churn over the years but additions being done in the same timeframe as XORs has been pretty constant. The Pentium 4 double pumped its ALU but both XORs and ADDs could happen in a half cycle latency. The POWER 6 cut the FO4s of latency in stage from 16 to 10 and kept that parity as well. When you need 2 FO4s for latching between stages and 2 to handle clock jitter at high frequencies the difference between what a XOR needs and what an ADD need start looking smaller, particularly when you include the circuitry to move the data and select the instruction. Maybe if we move to asynchronous circuits? | |
| ▲ | vablings 2 hours ago | parent | prev [-] | | Defacto standard, Compilers optimize for the CPU, CPU uarch is now optimizing for compilers |
| |
| ▲ | RiverCrochet 3 hours ago | parent | prev | next [-] | | [dead] | |
| ▲ | idontwantthis 2 hours ago | parent | prev [-] | | The blog post is about why this is idiomatic not whether it needs to be done that way today. It’s idiomatic because once upon a time none of that existed and xor gates did. The author apparently never took intro to digital logic. |
|
|
| ▲ | commandlinefan 3 hours ago | parent | prev | next [-] |
| It's still the same number of clock cycles, though, isn't it? You're using some extra circuitry during the SUB, but during the XOR, that circuitry is just sitting idle anyway, so it's still six of one/half a dozen of the other. |
| |
| ▲ | DSMan195276 3 hours ago | parent | next [-] | | It all depends on the CPU architecture, if it supports something like out-of-order execution then both parts of the CPU could be in use at the same time to execute different instructions. Realistically any CPU with that level of complexity doesn't care about SUB vs XOR though. | | |
| ▲ | monocasa 15 minutes ago | parent [-] | | Also, because SUB is implemented internally with XOR, so it's normally the same gates, with different signals selecting a different function. |
| |
| ▲ | 3 hours ago | parent | prev | next [-] | | [deleted] | |
| ▲ | 3 hours ago | parent | prev | next [-] | | [deleted] | |
| ▲ | RiverCrochet 3 hours ago | parent | prev [-] | | XOR can do everything in 1 cycle (which is hopefully far, far less than the clock). SUB-if done the simple way-has to take n cycles where n is the number of bits subtracted. | | |
| ▲ | anvuong 3 hours ago | parent | next [-] | | That's just not true. You think subtracting/adding 64-bit numbers actually take 64 cycles? There is sequential implementation of ripple carry adder that uses clock and register, this will add 1-bit per cycle, but no body uses this for obvious reason, it's just a toy example for education. A normal ripple carry adder will have some delay in propagation time before the output is valid, but that is much less a clock cycle. You can also design a customized adder circuit for 4-bit 8-bit 16-bit etc separately that would greatly minimizes the propagation delay to only 2 or 3 levels of gates, instead of n gates like in the ripple carry adder. | | |
| ▲ | valleyer 3 hours ago | parent [-] | | Right. In other words, the clock cycle is already made to be long enough to allow a word-sized SUB to settle. An XOR-with-self surely settles faster, but it still has to wait for that same clock cycle before proceeding. |
| |
| ▲ | jonathrg 3 hours ago | parent | prev [-] | | What do you mean by cycles? A ripple-carry adder needs to wait for the carry bits to ripple through yes, but there's no clock cycle involved. | | |
|
|
|
| ▲ | monocasa 16 minutes ago | parent | prev | next [-] |
| Except ALUs share hardware with logic functions. Internally the adder (which is also used as a subtractor by ones complementing one of the inputs and inverting the initial carry in) uses xor, and you can implement the XOR logic op with the same gates. Also, modern ALUs don't use ripple carries really any more, but instead stuff like a Kogge-Stone adder (or really, typically a hierarchical set of different techniques). https://en.wikipedia.org/wiki/Kogge%E2%80%93Stone_adder |
|
| ▲ | z500 2 hours ago | parent | prev [-] |
| That was what I guessed too, but according to the article Intel detected both. |