- Handling of misaligned loads/stores: RISC-V got itself into a weird middle ground, ops on misaligned pointers may work fine, may work "extremely slow", or cause fatal exceptions (yes, I know about Zicclsm, it's extremely new and only helps with the latter, also see https://github.com/llvm/llvm-project/issues/110454). Other platforms either guarantee "reasonable" performance for such operations, or forbid misaligned access with "aligned" loads/stores and provide separate misaligned instructions. Arguably, RISC-V should've done the latter (with misaligned instructions defined in a separate higher-end extension), since passing unaligned pointer into an aligned instruction signals correctness problems in software.

- The hardcoded page size. 4 KiB is a good default for RV32, but arguably a huge missed opportunity for RV64.

- The weird restriction in the forward progress guarantees for LR/SC sequences, which forces compilers to compile `compare_exchange` and `compare_exchange_weak` in the absolutely same way. See this issue for more information: https://github.com/riscv/riscv-isa-manual/issues/2047

- The `seed` CSR: it does not provide a good quality entropy (i.e. after you accumulated 256 bits of output, it may contain only 128 bits of randomness). You have to use a CSPRNG on top of it for any sensitive applications. Doing so may be inefficient and will bloat binary size (remember, the relaxed requirement was introduced for "low-powered" devices). Also, software developers may make mistake in this area (not everyone is a security expert). Similar alternatives like RDRAND (x86) and RNDR (ARM) guarantee proper randomness and we can use their output directly for cryptographic keys with very small code footprint.

- Extensions do not form hierarchies: it looks like the AVX-512 situation once again, but worse. Profiles help, but it's not a hierarchy, but a "packet". Also, there are annoyances like Zbkb not being a proper subset of Zbb.

- Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode. This makes detection quite annoying for "universal" libraries which intend to support various OSes and embedded targets. The CPUID instruction (x86) is ideal in this regard. I totally disagree with the virtualization argument against it, nothing prevents VM from intercepting the read, no one excepts huge performance from such reads.

And this list is compiled after a pretty surface-level dive into the RISC-V spec. I heard about other issues (e.g. being unable to port tricky SIMD code to the V extension or underspecification around memory coherence important for writing drivers), but I can not confidently talk about those, so it's not part of my list.

P.S.: I would be interested to hear about other people gripes with RISC-V.

▲

ack_complete 3 days ago | parent | next [-]

> Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode.

Not a RISC-V programmer, but this drives me crazy on ARM. Dozens of optional features, but the FEAT_ bits are all readable only from EL1, and it's unspecified what API the OS exposes to query it and which feature bits are exposed. I don't care if it'd be slow, just give us the equivalent of a dedicated CPUID instruction, even if it just a reserved opcode that traps to kernel mode and is handled in software.

	▲	cesarb 3 days ago \| parent [-]
		> but the FEAT_ bits are all readable only from EL1, [...] I don't care if it'd be slow, just give us the equivalent of a dedicated CPUID instruction, even if it just a reserved opcode that traps to kernel mode and is handled in software. I like the way the Linux kernel solves this: these FEAT_ bits are also readable from EL0, since trying to read them traps to kernel mode and the read is emulated by the kernel. See https://docs.kernel.org/arch/arm64/cpu-feature-registers.htm... for details. Unfortunately, it's a Linux-only feature, and didn't exist originally (so old enough Linux kernel versions won't have the emulation).

▲

brandmeyer 3 days ago | parent | prev | next [-]

Nothing major, just some oddball decisions here and there.

Fused compare-and-branch only extends to the base integer instructions. Anything else needs to generate a value that feeds into a compare-and-branch. Since all branches are compare-and-branch, they all need two register operands, which impairs their reach to a mere +/- 4 kB.

The reach for position-independent code instructions (AUIPC + any load or store) is not quite +/- 2 GB. There is a hole on either end of the reach that is a consequence of using a sign-extended 12-bit offset for loads and stores, and a sign-extended high 20-bit offset for AIUPC. ARM's adrp (address of page) + unsigned offsets is more uniform.

RV32 isn't a proper subset of RV64, which isn't a proper subset of RV128. If they were proper subsets, then RV64 programs could run unmodified on RV128 hardware. Not that its going to ever happen, but if it did, the processor would have to mode-switch, not unlike the x86-64 transition of yore.

Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes. I can count on zero hands the number of times I've needed that.

The integer ISA design goes to great lengths to avoid any instructions with three source operands, in order to simplify the datapaths on tiny machines. But... the floating point extension correctly includes fused multiply-add. So big chunks of any high-end processor will need three-operand datapaths anyway.

The base ISA is entirely too basic, and a classic failure of 90% design. Just because most code doesn't need all those other instructions doesn't mean that most systems don't. RISC-V is gathering extensions like a Katamari to fill in all those holes (B, Zfa, etc).

None of those things make it bad, I just don't think its nearly as shiny as the hype. ARM64+SVE and x86-64+AVX512 are just better.

	▲	adgjlsfhk1 2 days ago \| parent \| next [-]
		> Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes. IMO this is way better than the alternative in x86 and ARM. The reason no one deals with rounding modes is because changing the mode is really slow and you always need to change it back or else everything breaks. Being able to do it in the instruction allows you to do operations with non-standard modes much more simply. For example, round-to-nearest-ties-to-odd can be incredibly useful to prevent double rounding.
	▲	adgjlsfhk1 2 days ago \| parent \| prev [-]
		> The base ISA is entirely too basic IMO this is very wrong. The base ISA is excellent for micro-controllers and teaching, but the ~90% of real implementations can add the extra 20 extensions to make a modern, fully featured CPU.

▲

dzaima 3 days ago | parent | prev | next [-]

Another bad choice (perhaps more accurately called a bug, but they chose to not do anything about it): vmv1.r & co (aka whole-vector-register move instructions) depend on valid vtype being set, despite not using any part of it (outside of the extreme edge-case of an interrupt happening in the middle of it, and the hardware wanting to chop the operation in half instead of finishing it (entirely pointless for application-class CPUs where VLEN isn't massive enough for that to in any way be useful; never mind moves being O(1) with register renaming))

So to move one vector register to another, you need to have a preceding vsetvl; worse, with the standard calling convention you may get illegal vtype after a function call! Even worse, the behavior is actually left reserved for for move with illegal vtype, so hardware can (and some does) just allow it, thereby making it impossible to even test for on some hardware.

Oh, and that thing about being able to stop a vector instruction midway through? You might think that's to allow guaranteeing fast interrupts while keeping easy forwards progress; but no, vector reductions cannot be restarted.. And there's the extremely horrific vfredosum[1], which is an ordered float sum reduction, i.e. a linear chain of N float adds, i.e. a (fp add latency) * (element count in vector) -cycle op that must be started completely over again if interrupted.

[1]: https://dzaima.github.io/intrinsics-viewer/#0q1YqVbJSKsosTtY...

▲

mixmastamyk 3 days ago | parent | prev | next [-]

Sounds like a job for RISC-6, or VI.

▲

adgjlsfhk1 3 days ago | parent | prev | next [-]

> - The hardcoded page size.

I'm pretty confident that this will get removed. It's an extension that made it's way into RVA23, but once anyone has a design big enough for it to be a burden, it can be dropped.

	▲	monocasa 3 days ago \| parent [-]
		That's really hard to drop. Fancier unix programs tend to make all kinds of assumptions about page size to do things like the double mapped ring buffer trick. https://en.wikipedia.org/wiki/Circular_buffer#Optimization In fact it looks like apple silicon maintains support for 4kb pages just for running Rosetta. It's one of those things like TSO that was enough of a pain to work around the assumptions that they just included hardware support for it that isn't enabled when running in regular arm software mode.

▲

camel-cdr 3 days ago | parent | prev [-]

> Handling of misaligned loads/stores

Agreed, I think the problem is that RVI doesn't want to/can't mandate implementation details.

I hope that the first few RVA23 cores will have proper misaligned load/store support and we can tell toolchains RVA23 or Zicclsm means fast misaligned load/store and future hardware that is stupid enough to not implement it, will just have to suffer.

There is some silver lining, because you can transform N misaligned loads into N+1 aligned ones + a few instructions to stich together the result. Currently this needs to be done manually, but hopefully it will be an optimization in future compiler versions: https://github.com/llvm/llvm-project/issues/150263 (Edit: oh, I should've recognised your username, xd)

> The hardcoded page size.

There is Svnapot, which is supposes to allow other page sizes, but I don't know enough about it to be sure it actually solves the problem properly.

> You have to use a CSPRNG on top of it for any sensitive applications

Shouldn't you have to do that reguardless and also mix in other kind of state on OS level?

> Extensions do not form hierarchies

The mandatory extensions in the RVA profiles are a hierarchy.

> Detection of available extensions

I think this is being worked on with unified disvover, whch should also cover other microarchitectural details.

There also is a neat toolchain solution with: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/s...

> being unable to port tricky SIMD code to the V extension

Anything NEON code is trivially ported to RVV, as is AVX-512 code that doesn't use GFNI which is pretty much the only extension that doesn't have a RVV equivalent yet (neither does NEON or SVE though).

Where the complaints come from is if you want to take full advantage of the native vector length in VLA code, which can sometimes be tricky, especially in existing projects which are sometimes build arround the assumption of fixed vector lengths. But you can always fall back to using RVV as a fixed vector length ISA with a much faster way of querrying vector length then CPUID.

> P.S.: I would be interested to hear about other people gripes with RISC-V

I feel like encoding scalar fmacc with three sources and seperate destinations and rounding modes was a huge waste of encoding space, I would trade that for a vpternlog equivalent, which also is a encoding hog, any day.

The vl=0 special case was a bad idea, now you have to know/predict vl!=0 to get rid of the vector destination as a read dependency, or have some mechanism to kill an instuction if vl=0.

There should've been restricted vrgather variants earlier, but I'm now (slowly) working on proposing them and a handfull of other new vector instructions (mask add/sub, pext/pdep, bmatflip).

Overall though, I think RVV came out suprizingly good, everything works thogether very nicely.

	▲	zozbot234 3 days ago \| parent [-]
		> I feel like encoding scalar fmacc with three sources and seperate destinations and rounding modes was a huge waste of encoding space This might be easily solved by defining new lighter varieties of the F/D/Q extensions (under new "Z" names) that just don't include the fmacc insn blocks and reserve them for extension. (Of course, these new extensions would themselves be incompatible with the full F/D/Q extensions, effectively deprecating them for general use and relegating them to special-purpose uses where the FMACC encodings are genuinely useful.) Something to think about if the 32-bit insn encoding space becomes excessively scarce.