The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

You don't need a mov instruction, you just OR with $zero. You don't need a load immediate instruction you just ADDI/ORI with $zero. You don't need a Neg instruction, you just SUB with $zero. All your Compare-And-Branch instructions get a compare with $zero variant for free.

I refuse to say this "zero register" approach is better, it is part of a wide design with many interacting features. But once you have 31 registers, it's quite cheap to allocate one register to be zero, and may actually save encoding space elsewhere. (And encoding space is always an issue with fixed width instructions).

AArch64 takes the concept further, they have a register that is sometimes acts as the zero register (when used in ALU instructions) and other times is the stack pointer (when used in memory instructions and a few special stack instructions).

▲

phkahler 10 hours ago | parent [-]

>> The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity.

Which if funny because IMHO RISC-V instruction encoding is garbage. It was all optimized around the idea of fixed length 32-bit instructions. This leads to weird sized immediates (12 bits?) and 2 instructions to load a 32 bit constant. No support for 64 bit immediates. Then they decided to have "compressed" instructions that are 16 bits, so it's somewhat variable length anyway.

IMHO once all the vector, AI and graphics instructions are nailed down they should make RISC-VI where it's almost the same but re-encoding the instructions. Have sensible 16-bit ones, 32-bit, and use immediate constants after the opcodes. It seems like there is a lot they could do to clean it up - obviously not as much as x86 ;-)

▲

kruador 7 hours ago | parent | next [-]

ARM64 also has fixed length 32-bit instructions. Yes, immediates are normally small and it's not particularly orthogonal as to how many bits are available.

The largest MOV available is 16 bits, but those 16 bits can be shifted by 0, 16, 32 or 48 bits, so the worst case for a 64-bit immediate is 4 instructions. Or the compiler can decide to put the data in a PC-relative pool and use ADR or ADRP to calculate the address.

ADD immediate is 12 bits but can optionally apply a 12-bit left-shift to that immediate, so for immediates up to 24 bits it can be done in two instructions.

ARM64 decoding is also pretty complex, far less orthogonal than ARM32. Then again, ARM32 was designed to be decodable on a chip with 25,000 transistors, not where you can spend thousands of transistors to decode a single instruction.

▲

zozbot234 10 hours ago | parent | prev | next [-]

There's not a strong case for redoing the RISC-V encoding with a new RISC-VI unless they run out of 32-bit encoding space outright, due to e.g. extensive new vector-like or AI-like instructions. And then they could free up a huge amount of encoding space trivially by moving to a 2-address format throughout with Rd=Rs1 and using a simple instruction fusion approach MOV Rd ← Rs1; OP Rd ← etc. for the former 3-address case.

(Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.)

▲

phkahler 5 hours ago | parent [-]

>> Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.

I really like the idea of composition or standard prefixes. My favorite is the idea of replacing cmp/branch with "if". Where the condition is a predicate for the following instruction. For RISC-V it would eat a large part of the 16bit opcodes. Some form of load/store might be a good use for the remaining 16bit ops. Other things that might be a good prefix could be encoding data types (8,16,32,64 bit, sign extended, float, double) or a source/destination register. It might be interesting to see how a full ISA might be decomposed into smaller instruction fragments.

	▲	zozbot234 5 hours ago \| parent [-]
		> "if". Where the condition is a predicate for the following instruction This is just a forward skip, which is optimized to a predicated insn already in some implementations.

▲

adgjlsfhk1 10 hours ago | parent | prev [-]

IMO the riscv decoding is really elegant (arguably excepting the C extension). Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory). Most 64 bit constants in use can be sign extended from much smaller values, and for those that can't, supporting 72 bit (or bigger) instructions just to be able to load a 64 bit immediate will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism), and will only be 2 cycles faster than a L1 cache load (if the instruction is hot). 32 bit immediate would be kind of nice, but the benefit is pretty small. An x86 instruction with 32 bit immediate is 6 bytes, while the 2 RISC-V instructions are 8 bytes. There have been proposals to add 48 bit instructions, which would let Risc-v have 32 bit immediate support with the same 6 bytes as x86 (and 12 byte 2 instructions 64 bit loads vs 10 bit for x86 in the very rare situations where doing so will be faster than a load).

ISA design is always a tradeoff, https://ics.uci.edu/~swjun/courses/2023F-CS250P/materials/le... has some good details, but the TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

▲

Tuna-Fish 8 hours ago | parent [-]

> Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory)

Strongly disagree. Throughput is cheap, latency is expensive. Any time you can fit a constant in the instruction fetch stream is a win. This is especially true for jump targets, because getting them resolved faster both saves power and improves performance.

> Most 64 bit constants in use can be sign extended from much smaller values

You should obviously also have smaller load instructions.

> will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism)

No, just have more fetch throughput.

> and will only be 2 cycles faster than a L1 cache load

Only on tiny machines will L1 cache load be 2 cycles. On a reasonable high-end machine it will be 4-5 cycles, and more critically (because the latency would usually be masked well by OoO), the energy required to engage the load path is orders of magnitude more than just getting it from the fetch.

And that's when it's not a jump target, when it's a jump target suddenly loading it using a load instruction adds 12+ cycles of latency.

> TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

No. Not even talking about constants, RISC-V makes insane choices for essentially religious reasons. Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?

	▲	adgjlsfhk1 7 hours ago \| parent [-]
		> No, just have more fetch throughput. Fetch throughput isn't unlimited. Modern x86 CPUs only have ~16-32B/cycle (from L2 once you're out of the uop cache). If you decode a single 10 byte instruction you're already using up a huge amount of the available decode bandwidth. There absolutely are cases where a 64 bit load instruction would be an advantage, but ISA design is always a case of tradeoffs. Allowing 10 byte instructions has real cost in decode complexity, instruction bandwidth requirements, ensuring cacheline/pageline alignment etc. You have to weigh against that how frequent the instruction would be as well as what your alternative options are. Most imediates are small, and many others can be efficiently synthesized via 2 other instructions (e.g. shifts/xors/nots) and any synthesis that is 2 instructions or fewer will be cheaper than doing a load anyway. As a result you would end up massively complicating your architecture/decoders to benefit a fairly rare instruction which probably isn't worthwhile. It's notable that aarch64 makes the same tradeoff here and Apple's M series processors have an IPC advantage over the best x86. > Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate? This mostly seems like a mistake to me. The rational probably is that you need the other instructions anyway (not all jumps are returns), so adding a jal that doesn't take a register would take a decent percentage of the opspace, but the extra 5 bits would be very nice.