▲ | jcranmer 3 days ago | ||||||||||||||||||||||
> x86 decoding must be a pain So one of the projects I've been working on and off again is the World's Worst x86 Decoder, which takes a principled approach to x86 decoding by throwing out most of the manual and instead reverse-engineering semantics based on running the instructions themselves to figure out what they do. It's still far from finished, but I've gotten it to the point that I can spit out decoder rules. As a result, I feel pretty confident in saying that x86 decoding isn't that insane. For example, here's the bitset for the first two opcode maps on whether or not opcodes have a ModR/M operand: ModRM=1111000011110000111100001111000011110000111100001111000011110000000000000000000000000000000000000011000001010000000000000000000011111111111111110000000000000000000000000000000000000000000000001100111100000000111100001111111100000000000000000000001100000011111100000000010011111111111111110000000011111111000000000000000011111111111111111111111111111111111111111111111111111110000011110000000000000000111111111111111100011100000111111111011110111111111111110000000011111111111111111111111111111111111111111111111 I haven't done a k-map on that, but... you can see that a boolean circuit isn't that complicated. Also, it turns out that this isn't dependent on presence or absence of any prefixes. While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle, which means the main limitation on the parallelism in the decoder is how wide you can build those muxes (which, to be fair, does have a cost). That said, there is one instruction where I want to go back in time and beat up the x86 ISA designers. f6/0, f6/1, f7/0, and f7/1 [1] take in an extra immediate operand whereas f6/2 and et al do not. It's the sole case in the entire ISA where this happens. [1] My notation for when x86 does its trick of using one of the register selector fields as extra bits for opcodes. | |||||||||||||||||||||||
▲ | monocasa 3 days ago | parent | next [-] | ||||||||||||||||||||||
> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle That's been my understanding as well. X86 style length decoding is about one pipeline stage if done dynamically. The simpler riscv length decoding ends up being about a half pipeline stage on the wider decoders. | |||||||||||||||||||||||
▲ | Dylan16807 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
> While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle That's some very faint praise there. Especially when you're trying to chop up several instructions every cycle. Meanwhile RISC-V is "count leading 1s. 0-1:16bit 2-4:32bit 5:48bit 6:64bit" | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | eigenform 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Don't know this for certain, but I always assumed that x86 implementations get away with this by predecoding cachelines. If you're going to do prefetching in parallel and decoupled from everything else, might as well move part of the work there too? (obviously not without cost - plus, you can identify branches early!) | |||||||||||||||||||||||
▲ | matja 3 days ago | parent | prev [-] | ||||||||||||||||||||||
Missing a 0 at the end |