Remix.run Logo
adrian_b 8 hours ago

Nope.

In any ALU the speed is determined by the slowest operation, so XOR is never faster. It does not matter which is the width of the ALU, all that matters is that an ALU does many kinds of operations, including XOR and subtraction, where the operation done by an ALU is selected by some control bits.

I have explained in another comment that the only CPUs where XOR can be faster than subtraction are the so-called superpipelined CPUs. Superpipelined CPUs have been made only after 1990 and there were very few such CPUs. Even if in superpipelined CPUs it is possible for XOR to be faster than subtraction, it is very unlikely that this feature has been implemented in anyone of the few superpipelined CPU models that have ever been made, because it would not have been worthwhile.

For general-purpose computers, there have never been "4-bit ALU times".

The first monolithic general-purpose processor was Intel 8008 (i.e. the monolithic version of Datapoint 2200), with an 8-bit ISA.

Intel claims that Intel 4004 was the first "microprocessor" (in order to move its priority earlier by one year), but that was not a processor for a general-purpose computer, but a calculator IC. Its only historical relevance for the history of personal computers is that the Intel team which designed 4004 gained a lot of experience with it and they established a logic design methodology with PMOS transistors, which they used for designing the Intel 8008 processor.

Intel 4004, its successors and similar 4-bit processors introduced later by Rockwell, TI and others, were suitable only for calculators or for industrial controllers, never for general-purpose computers.

The first computers with monolithic processors, a.k.a. microcomputers, used 8-bit processors, and then 16-bit processors, and so on.

For cost reduction, it is possible for an 8-bit ISA to use a 4-bit ALU or even just a serial 1-bit ALU, but this is transparent for the programmer and for general-purpose computers there never were 4-bit instruction sets.

deathanatos 3 hours ago | parent [-]

> In any ALU the speed is determined by the slowest operation, so XOR is never faster.

On a 386, a reg/reg ADD is 2 cycles. An r32 IMUL is "9-38" cycles.

If what you stated were true, you'd be locking XOR's speed to that of DIV. (Or you do not consider MUL/DIV "arithmetic", or something.)

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/op...

> I have explained in another comment that the only CPUs where XOR can be faster than subtraction are the so-called superpipelined CPUs. Superpipelined CPUs have been made only after 1990 and there were very few such CPUs.

(And I'm choosing 386 to avoid it being "a superpipelined CPU".)

hcs 3 hours ago | parent | next [-]

> Or you do not consider MUL/DIV "arithmetic", or something.

Multiplier and divider are usually not considered part of the ALU, yes. Not uncommon for those to be shared between execution threads while there's an ALU for each.

adrian_b an hour ago | parent | prev [-]

386 is a microprogrammed CPU where a multiplication is dome by a long sequence of microinstructions, including a loop that is executed a variable number of times, hence its long and variable execution time.

A register-register operation required 2 microinstructions, presumably for an ALU operation and for writing back into the register file.

Unlike the later 80486 which had execution pipelines that allowed consecutive ALU operations to be executed back-to-back, so the throughput was 1 ALU operation per clock cycle, in 80386 there was only some pipelining of the overall instruction execution, i.e. instruction fetching and decoding was overlapped with microinstruction execution, but there was no pipelining at a lower level, so it was not possible to execute ALU operations back to back. The fastest instructions required 2 clock cycles and most instructions required more clock cycles.

In 80386, the ALU itself required the same 1 clock cycle for executing either XOR or SUB, but in order to complete 1 instruction the minimum time was 2 clock cycles.

Moreover, this time of 2 clock cycles was optimistic, it assumed that the processor had succeeded to fetch and decode the instruction before the previous instruction was completed. This was not always true, so a XOR or a SUB could randomly require more than 2 clock cycles, when it needed to finish instruction decoding or fetching before doing the ALU operation.

In very old or very cheap processors there are no dedicated multipliers and dividers, so a multiplication or division is done by a sequence of ALU operations. In any high performance processor, multiplications are done by dedicated multipliers and there are also dedicated division/square root devices with their own sequencers. The dividers may share some circuits with the multipliers, or not. When the dividers share some circuits with the multipliers, divisions and multiplications cannot be done concurrently.

In many CPUs, the dedicated multipliers may share some surrounding circuits with an ALU, i.e. they may be connected to the same buses and they may be fed by the same scheduler port, so while a multiplication is executed the associated ALU cannot be used. Nevertheless the core multiplier and ALU remain distinct, because a multiplier and an ALU have very distinct structures. An ALU is built around an adder by adding a lot of control gates that allow the execution of related arithmetic operations, e.g. subtraction/comparison/increment/decrement and of bitwise operations. In cheaper CPUs the ALU can also do shifts and rotations, while in more performant CPUs there may be a dedicated shifter separated from the ALU.

The term ALU can be used with 2 different senses. The strict sense is that an ALU is a digital adder augmented with control gates that allow the selection of any operation from a small set, typically of 8 or 16 or 32 operations, which are simple arithmetic or bitwise operations. Before the monolithic processors, computers were made using separate ALU circuits, like TI SN74181+SN74182 or circuits combining an ALU with registers, e.g. AMD 2901/2903.

In the wide sense, ALU may be used to designate an execution unit of a processor, which may include many subunits, which may be ALUs in the strict sense, shifters, multipliers, dividers, shufflers etc.

An ALU in the strict sense is the minimal kind of execution unit required by a processor. The modern high-performance processors have much more complex execution units.