| |
| ▲ | kruador 8 hours ago | parent | next [-] | | ARM64 assembly has a MOV instruction, but for most of the ways it's used, it's an alias in the assembler to something else. For example, MOV between two registers actually generates ORR rd, rZR, rm, i.e. rd := (zero-register) OR rm. Or, a MOV with a small immediate is ORR rd, rZR, #imm. If trying to set the stack pointer, or copy the stack pointer, instead the underlying instruction is ADD SP, Xn, #0 i.e. SP = Xn + 0. This is because the stack pointer and zero register are both encoded as register 31 (11111). Some instructions allow you to use the zero register, others the stack pointer. Presumably ORR uses the zero register and ADD the stack pointer. NOP maps to HINT #0. There are 128 HINT values available; anything not implemented on this processor executes as a NOP. There are other operations that are aliased like CMP Xm, Xn is really an alias for SUBS XZR, Xm, Xn: subtract Xn from Xm, store the result in the zero register [i.e. discard it], and set the flags. RISC-V doesn't have flags, of course. ARM Ltd clearly considered them still useful. There are other oddities, things like 'rotate right' is encoded as 'extract register from pair of registers', but it specifies the same source register twice. Disassemblers do their best to hide this from you. ARM list a 'preferred decoding' for any instruction that has aliases, to map back to a more meaningful alias wherever possible. | |
| ▲ | Findecanor 8 hours ago | parent | prev | next [-] | | There is a `c.mv` instruction in the compressed set, which most RISC-V processors implement. That, and `add rd, rs, x0` could (like the zeroing idiom on x86), run entirely in the decoding and register-renaming stages of a processor. RISC-V does actually have quite a few idioms. Some idioms are multi-instruction sequences ("macro ops") that could get folded into single micro-ops ("macro-op fusion"/"instruction fusion"): for example `lui` followed by `addi` for loading a 32-bit constant, and left shift followed by right shift for extracting a bitfield. | |
| ▲ | dist1ll 11 hours ago | parent | prev [-] | | Another one is "jalr x0, imm(x0)", which turns an indirect branch into a direct jump to address "imm" in a single instruction w/o clobbering a register. Pretty neat. |
|
| |
| ▲ | phire 11 hours ago | parent | next [-] | | The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity. You don't need a mov instruction, you just OR with $zero. You don't need a load immediate instruction you just ADDI/ORI with $zero. You don't need a Neg instruction, you just SUB with $zero. All your Compare-And-Branch instructions get a compare with $zero variant for free. I refuse to say this "zero register" approach is better, it is part of a wide design with many interacting features. But once you have 31 registers, it's quite cheap to allocate one register to be zero, and may actually save encoding space elsewhere. (And encoding space is always an issue with fixed width instructions). AArch64 takes the concept further, they have a register that is sometimes acts as the zero register (when used in ALU instructions) and other times is the stack pointer (when used in memory instructions and a few special stack instructions). | | |
| ▲ | phkahler 10 hours ago | parent [-] | | >> The zero register helps RISC-V (and MIPS before it) really cut down on the number of instructions, and hardware complexity. Which if funny because IMHO RISC-V instruction encoding is garbage. It was all optimized around the idea of fixed length 32-bit instructions. This leads to weird sized immediates (12 bits?) and 2 instructions to load a 32 bit constant. No support for 64 bit immediates. Then they decided to have "compressed" instructions that are 16 bits, so it's somewhat variable length anyway. IMHO once all the vector, AI and graphics instructions are nailed down they should make RISC-VI where it's almost the same but re-encoding the instructions. Have sensible 16-bit ones, 32-bit, and use immediate constants after the opcodes. It seems like there is a lot they could do to clean it up - obviously not as much as x86 ;-) | | |
| ▲ | kruador 7 hours ago | parent | next [-] | | ARM64 also has fixed length 32-bit instructions. Yes, immediates are normally small and it's not particularly orthogonal as to how many bits are available. The largest MOV available is 16 bits, but those 16 bits can be shifted by 0, 16, 32 or 48 bits, so the worst case for a 64-bit immediate is 4 instructions. Or the compiler can decide to put the data in a PC-relative pool and use ADR or ADRP to calculate the address. ADD immediate is 12 bits but can optionally apply a 12-bit left-shift to that immediate, so for immediates up to 24 bits it can be done in two instructions. ARM64 decoding is also pretty complex, far less orthogonal than ARM32. Then again, ARM32 was designed to be decodable on a chip with 25,000 transistors, not where you can spend thousands of transistors to decode a single instruction. | |
| ▲ | zozbot234 10 hours ago | parent | prev | next [-] | | There's not a strong case for redoing the RISC-V encoding with a new RISC-VI unless they run out of 32-bit encoding space outright, due to e.g. extensive new vector-like or AI-like instructions. And then they could free up a huge amount of encoding space trivially by moving to a 2-address format throughout with Rd=Rs1 and using a simple instruction fusion approach MOV Rd ← Rs1; OP Rd ← etc. for the former 3-address case. (Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach.) | | |
| ▲ | phkahler 5 hours ago | parent [-] | | >> Any instruction that can be similarly rephrased as a composition of more restricted elementary instructions is also a candidate for this macro-insn approach. I really like the idea of composition or standard prefixes. My favorite is the idea of replacing cmp/branch with "if". Where the condition is a predicate for the following instruction. For RISC-V it would eat a large part of the 16bit opcodes. Some form of load/store might be a good use for the remaining 16bit ops. Other things that might be a good prefix could be encoding data types (8,16,32,64 bit, sign extended, float, double) or a source/destination register. It might be interesting to see how a full ISA might be decomposed into smaller instruction fragments. | | |
| ▲ | zozbot234 5 hours ago | parent [-] | | > "if". Where the condition is a predicate for the following instruction This is just a forward skip, which is optimized to a predicated insn already in some implementations. |
|
| |
| ▲ | adgjlsfhk1 10 hours ago | parent | prev [-] | | IMO the riscv decoding is really elegant (arguably excepting the C extension). Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory). Most 64 bit constants in use can be sign extended from much smaller values, and for those that can't, supporting 72 bit (or bigger) instructions just to be able to load a 64 bit immediate will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism), and will only be 2 cycles faster than a L1 cache load (if the instruction is hot). 32 bit immediate would be kind of nice, but the benefit is pretty small. An x86 instruction with 32 bit immediate is 6 bytes, while the 2 RISC-V instructions are 8 bytes. There have been proposals to add 48 bit instructions, which would let Risc-v have 32 bit immediate support with the same 6 bytes as x86 (and 12 byte 2 instructions 64 bit loads vs 10 bit for x86 in the very rare situations where doing so will be faster than a load). ISA design is always a tradeoff, https://ics.uci.edu/~swjun/courses/2023F-CS250P/materials/le... has some good details, but the TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA. | | |
| ▲ | Tuna-Fish 8 hours ago | parent [-] | | > Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory) Strongly disagree. Throughput is cheap, latency is expensive. Any time you can fit a constant in the instruction fetch stream is a win. This is especially true for jump targets, because getting them resolved faster both saves power and improves performance. > Most 64 bit constants in use can be sign extended from much smaller values You should obviously also have smaller load instructions. > will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism) No, just have more fetch throughput. > and will only be 2 cycles faster than a L1 cache load Only on tiny machines will L1 cache load be 2 cycles. On a reasonable high-end machine it will be 4-5 cycles, and more critically (because the latency would usually be masked well by OoO), the energy required to engage the load path is orders of magnitude more than just getting it from the fetch. And that's when it's not a jump target, when it's a jump target suddenly loading it using a load instruction adds 12+ cycles of latency. > TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA. No. Not even talking about constants, RISC-V makes insane choices for essentially religious reasons. Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate? | | |
| ▲ | adgjlsfhk1 7 hours ago | parent [-] | | > No, just have more fetch throughput. Fetch throughput isn't unlimited. Modern x86 CPUs only have ~16-32B/cycle (from L2 once you're out of the uop cache). If you decode a single 10 byte instruction you're already using up a huge amount of the available decode bandwidth. There absolutely are cases where a 64 bit load instruction would be an advantage, but ISA design is always a case of tradeoffs. Allowing 10 byte instructions has real cost in decode complexity, instruction bandwidth requirements, ensuring cacheline/pageline alignment etc. You have to weigh against that how frequent the instruction would be as well as what your alternative options are. Most imediates are small, and many others can be efficiently synthesized via 2 other instructions (e.g. shifts/xors/nots) and any synthesis that is 2 instructions or fewer will be cheaper than doing a load anyway. As a result you would end up massively complicating your architecture/decoders to benefit a fairly rare instruction which probably isn't worthwhile. It's notable that aarch64 makes the same tradeoff here and Apple's M series processors have an IPC advantage over the best x86. > Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate? This mostly seems like a mistake to me. The rational probably is that you need the other instructions anyway (not all jumps are returns), so adding a jal that doesn't take a register would take a decent percentage of the opspace, but the extra 5 bits would be very nice. |
|
|
|
| |
| ▲ | wongarsu 10 hours ago | parent | prev | next [-] | | MIPS for example also has one, along with a similar number of registers (~32). So it's not like RISC-V took a radical new position here, they were able to look back at what worked and what didn't, and decided that for their target a zero register was the right tradeoff. It's certainly the more "elegant" solution. A zero register is useful as input or output register for all kinds of operations, not just for zeroing | |
| ▲ | kevin_thibedeau 11 hours ago | parent | prev | next [-] | | It's a definite liability on a machine with only 8 general purpose registers. Losing 12% of the register space for a constant would be a waste of hardware. | | |
| ▲ | menaerus 11 hours ago | parent [-] | | 8 registers? Ever heard of register renaming? | | |
| ▲ | Polizeiposaune 10 hours ago | parent | next [-] | | Ever heard of a loop that needed to keep more than 7 variables live? Register renaming helps with pipelining and out-of-order execution, but instructions in the program can only reference the architectural registers - go beyond that and you end up needing to spill some values to (architectural) memory. There's a reason why AMD added r8-r15 to the architecture, and why intel is adding r16-r31.. | | |
| ▲ | menaerus 10 hours ago | parent [-] | | I have but that was not the point? My first point was exactly that there are more ISA registers and not only 8, and therefore the question mark. My second point was about register renaming which, contrary what you say, does mitigate the artifacts of running out of registers by spilling the variables to the stack memory. It does it by eliminating the false dependencies between variables/registers and xor eax, eax is a great candidate for that. | | |
| ▲ | saagarjha 8 hours ago | parent [-] | | Register renaming does not let you avoid spills. | | |
| ▲ | menaerus 6 hours ago | parent [-] | | Ok, it obviously doesn't increase the number of ISA registers. What I am suggesting is something else - imagine a situation in which the compiler understands that the spill over will take place, and therefore rearranges the code such that it reduces the pressure on the registers. It can do that if it can break the data dependencies between the variables for instance. Or it can do that by unrolling the loops or by moving the initialization closer to where the variable is being used, no? I am pretty certain that compilers are already doing these kind of transformations, and in a sense this is taking advantage of register renaming but indirectly. |
|
|
| |
| ▲ | account42 10 hours ago | parent | prev | next [-] | | That's irrelevant, the zero register would be taking a slot in the limited register addressing bits in instructions, not replace a physical register on the chip. | |
| ▲ | kevin_thibedeau 8 hours ago | parent | prev [-] | | 8086 doesn't have that. |
|
| |
| ▲ | gruez 11 hours ago | parent | prev | next [-] | | It's basically the eternal debate of RISC vs CISC (x86). RISC proponents claim RISC is better because it's simpler to decode. CISC proponents retort that CISC means code can be more compact, which helps with cache hits. | | |
| ▲ | bluGill 11 hours ago | parent | next [-] | | In the real world there is no CISC or RISC anymore. RISC is always extended to some new feature and suddenly becomes more complex. Meanwhile CISC is just a decoder over a RISC processor. Either way you get the best of both worlds: simple hardware (the RISC internals and CSIC instructions that do what you need. Don't get too carried away in the above, x86 is still a lot more complex than ARM or RISC-V. However the complexity is only a tiny part of a CPU and so it doesn't matter. | | |
| ▲ | snvzz 2 hours ago | parent [-] | | You seem to be confusing ISA and microarchitecture. Modern ISAs try really hard to be independent from microarchitecture. |
| |
| ▲ | snvzz 2 hours ago | parent | prev [-] | | >CISC proponents retort that CISC means code can be more compact, RISC-V has the most compact code on 64bit, with margin to boot. On 32bit, it used to be behind Thumb2, but it's the best as of the bit manipulation and extra compressed extensions circa 2021. |
| |
| ▲ | dooglius 11 hours ago | parent | prev [-] | | I think one could just pick a convention where a particular GP register is zeroed at program startup and just make your own zero register that way, getting all the benefits at very small cost. The microarchitecture AIUI has a dedicated zero register so any processor-level optimizations would still apply. | | |
| ▲ | pklausler 11 hours ago | parent [-] | | That’s what was done on the CDC 6600 with two handy values, B0 (0) and B1 (1). |
|
|