Remix.run Logo
jcranmer 3 days ago

> x86 decoding must be a pain

So one of the projects I've been working on and off again is the World's Worst x86 Decoder, which takes a principled approach to x86 decoding by throwing out most of the manual and instead reverse-engineering semantics based on running the instructions themselves to figure out what they do. It's still far from finished, but I've gotten it to the point that I can spit out decoder rules.

As a result, I feel pretty confident in saying that x86 decoding isn't that insane. For example, here's the bitset for the first two opcode maps on whether or not opcodes have a ModR/M operand: ModRM=1111000011110000111100001111000011110000111100001111000011110000000000000000000000000000000000000011000001010000000000000000000011111111111111110000000000000000000000000000000000000000000000001100111100000000111100001111111100000000000000000000001100000011111100000000010011111111111111110000000011111111000000000000000011111111111111111111111111111111111111111111111111111110000011110000000000000000111111111111111100011100000111111111011110111111111111110000000011111111111111111111111111111111111111111111111

I haven't done a k-map on that, but... you can see that a boolean circuit isn't that complicated. Also, it turns out that this isn't dependent on presence or absence of any prefixes. While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle, which means the main limitation on the parallelism in the decoder is how wide you can build those muxes (which, to be fair, does have a cost).

That said, there is one instruction where I want to go back in time and beat up the x86 ISA designers. f6/0, f6/1, f7/0, and f7/1 [1] take in an extra immediate operand whereas f6/2 and et al do not. It's the sole case in the entire ISA where this happens.

[1] My notation for when x86 does its trick of using one of the register selector fields as extra bits for opcodes.

monocasa 3 days ago | parent | next [-]

> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle

That's been my understanding as well. X86 style length decoding is about one pipeline stage if done dynamically.

The simpler riscv length decoding ends up being about a half pipeline stage on the wider decoders.

Dylan16807 3 days ago | parent | prev | next [-]

> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle

That's some very faint praise there. Especially when you're trying to chop up several instructions every cycle. Meanwhile RISC-V is "count leading 1s. 0-1:16bit 2-4:32bit 5:48bit 6:64bit"

mohinder 3 days ago | parent [-]

The chopping up can happen the next cycle, in parallel across all the instructions in the cache line(s) that were fetched, and it can be pipelined so there's no loss in throughput. Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8). And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too.

codedokode 3 days ago | parent | next [-]

As for program size, I played with small algorithms (like binary search) on godbolt, and my x86 programs had similar size to RISC-V with compressed instructions. I rarely saw 1-byte instructions, there almost always was at least one prefix.

> e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8

With compressed instructions the theoretical maximum is 16.

> so they have to deal with the problem too.

Luckily you can determine the length from first bits of an instruction, and you can have either 2 bytes left from previous line, or 0.

Dylan16807 3 days ago | parent | prev [-]

> The chopping up can happen the next cycle

It still causes issues.

> Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8).

RISC-V has better code density. The handful of one byte instructions don't make up for other longer instructions.

> And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too.

Now? Have to deal with the problem too?

It feels like you didn't read my previous post. I was explaining how it's much much simpler to decode length. And the variable length has been there since the original version.

eigenform 3 days ago | parent | prev | next [-]

Don't know this for certain, but I always assumed that x86 implementations get away with this by predecoding cachelines. If you're going to do prefetching in parallel and decoupled from everything else, might as well move part of the work there too? (obviously not without cost - plus, you can identify branches early!)

matja 3 days ago | parent | prev [-]

Missing a 0 at the end