| ▲ | codedokode 3 days ago |
| x86 decoding must be a pain - I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy. Also, more complicated decoding and extra caches means longer pipeline, which means more price to pay when a branch is mispredicted (binary search is a festival of branch misprediction for example, and I got 3x acceleration of linear search on small arrays when I switched to the branchless algorithm). Also I am not a CPU designer, but branch prediction with wide decoder also must be a pain - imagine that while you are loading 16 or 32 bytes from instruction cache, you need to predict the address of next loaded chunk in the same cycle, before you even see what you got from cache. As for encoding efficiency, I played with little algorithms (like binary search or slab allocator) on godbolt, and RISC-V with compressed instruction generates similar amount of code as x86 - in rare cases, even slightly smaller. So x86 has a complex decoding that doesn't give any noticeable advantages. x86 also has flags, which add implicit dependencies between instructions, and must make designer's life harder. |
|
| ▲ | wallopinski 3 days ago | parent | next [-] |
| I was an instruction fetch unit (IFU) architect on P6 from 1992-1995. And yes, it was a pain, and we had close to 100x the test vectors of all the other units, going back to the mid 1980's. Once we started going bonkers with the prefixes, we just left the pre-Pentium decoder alone and added new functional blocks to handle those. And it wasn't just branch prediction that sucked, like you called out! Filling the instruction cache was a nightmare, keeping track of head and tail markers, coalescing, rebuilding, ... lots of parallel decoding to deal with cache and branch-prediction improvements to meet timing as the P6 core evolved was the typical solution. We were the only block (well, minus IO) that had to deal with legacy compatibility. Fortunately I moved on after the launch of Pentium II and thankfully did not have to deal with Pentium4/Northwood. |
| |
| ▲ | nerpderp82 3 days ago | parent [-] | | https://en.wikipedia.org/wiki/P6_(microarchitecture) The P6 is arguably the most important x86 microarch ever, it put Intel on top over the RISC workstations. What was your favorite subsystem in the P6 arch? Was it designed in Verilog? What languages and tools were used to design P6 and the PPro? | | |
| ▲ | wallopinski 3 days ago | parent [-] | | Well duh, the IFU. :) No, I was fond of the FPU because the math was just so bonkers. The way division was performed with complete disregard to the rules taught to gradeschoolers always fascinated me. Bob Colwell told us that P6 was the last architecture one person could understand completely. Tooling & Languages: IHDL, a templating layer on top of HDL that had a preprocessor for intel-specific macros. DART test template generator for validation coverage vectors. The entire system was stitched together with PERL, TCL, and shellscripts, and it all ran on three OSes: AIX, HPUX and SunOS. (I had a B&W sparcstation and was jealous of the 8514/a 1024x768 monitors on AIX.) We didn't go full Linux until Itanic and by then we were using remote computing via Exceed and gave up our workstations for generic PCs. When I left in the mid 2000's, not much had changed in the glue/automation languages, except a little less Tcl. I'm blanking on the specific formal verification tool, I think it was something by Cadence. Synthesis and timing was ... design compiler and primetime? Man. Cobwebs. When I left we were 100% Cadence and Synopsys and Verilog (minus a few custom analog tools based on SPICE for creating our SSAs). That migration happened during Bonnell, but gahd it was painful. Especially migrating all the Pentium/486/386/286/8088 test vectors. I have no idea what it is like ~20 years later (gasp), but I bet the test vectors live on, like Henrietta Lacks' cells. I'd be interested to hear from any Intelfolk reading this? |
|
|
|
| ▲ | jcranmer 3 days ago | parent | prev | next [-] |
| > x86 decoding must be a pain So one of the projects I've been working on and off again is the World's Worst x86 Decoder, which takes a principled approach to x86 decoding by throwing out most of the manual and instead reverse-engineering semantics based on running the instructions themselves to figure out what they do. It's still far from finished, but I've gotten it to the point that I can spit out decoder rules. As a result, I feel pretty confident in saying that x86 decoding isn't that insane. For example, here's the bitset for the first two opcode maps on whether or not opcodes have a ModR/M operand: ModRM=1111000011110000111100001111000011110000111100001111000011110000000000000000000000000000000000000011000001010000000000000000000011111111111111110000000000000000000000000000000000000000000000001100111100000000111100001111111100000000000000000000001100000011111100000000010011111111111111110000000011111111000000000000000011111111111111111111111111111111111111111111111111111110000011110000000000000000111111111111111100011100000111111111011110111111111111110000000011111111111111111111111111111111111111111111111 I haven't done a k-map on that, but... you can see that a boolean circuit isn't that complicated. Also, it turns out that this isn't dependent on presence or absence of any prefixes. While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle, which means the main limitation on the parallelism in the decoder is how wide you can build those muxes (which, to be fair, does have a cost). That said, there is one instruction where I want to go back in time and beat up the x86 ISA designers. f6/0, f6/1, f7/0, and f7/1 [1] take in an extra immediate operand whereas f6/2 and et al do not. It's the sole case in the entire ISA where this happens. [1] My notation for when x86 does its trick of using one of the register selector fields as extra bits for opcodes. |
| |
| ▲ | monocasa 3 days ago | parent | next [-] | | > While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle That's been my understanding as well. X86 style length decoding is about one pipeline stage if done dynamically. The simpler riscv length decoding ends up being about a half pipeline stage on the wider decoders. | |
| ▲ | Dylan16807 3 days ago | parent | prev | next [-] | | > While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle That's some very faint praise there. Especially when you're trying to chop up several instructions every cycle. Meanwhile RISC-V is "count leading 1s. 0-1:16bit 2-4:32bit 5:48bit 6:64bit" | | |
| ▲ | mohinder 3 days ago | parent [-] | | The chopping up can happen the next cycle, in parallel across all the instructions in the cache line(s) that were fetched, and it can be pipelined so there's no loss in throughput. Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8). And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too. | | |
| ▲ | codedokode 3 days ago | parent | next [-] | | As for program size, I played with small algorithms (like binary search) on godbolt, and my x86 programs had similar size to RISC-V with compressed instructions. I rarely saw 1-byte instructions, there almost always was at least one prefix. > e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8 With compressed instructions the theoretical maximum is 16. > so they have to deal with the problem too. Luckily you can determine the length from first bits of an instruction, and you can have either 2 bytes left from previous line, or 0. | |
| ▲ | Dylan16807 3 days ago | parent | prev [-] | | > The chopping up can happen the next cycle It still causes issues. > Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8). RISC-V has better code density. The handful of one byte instructions don't make up for other longer instructions. > And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too. Now? Have to deal with the problem too? It feels like you didn't read my previous post. I was explaining how it's much much simpler to decode length. And the variable length has been there since the original version. |
|
| |
| ▲ | eigenform 3 days ago | parent | prev | next [-] | | Don't know this for certain, but I always assumed that x86 implementations get away with this by predecoding cachelines. If you're going to do prefetching in parallel and decoupled from everything else, might as well move part of the work there too? (obviously not without cost - plus, you can identify branches early!) | |
| ▲ | matja 3 days ago | parent | prev [-] | | Missing a 0 at the end |
|
|
| ▲ | monocasa 3 days ago | parent | prev | next [-] |
| > x86 decoding must be a pain - I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy. To be fair, a lot of modern ARM cores also have uop caches. There's a lot to decide even without the variable length component, to the point that keeping a cache of uops and temporarily turning pieces of the IFU off can be a win. |
|
| ▲ | jabl 3 days ago | parent | prev | next [-] |
| > I vaguely remember that they have trace caches (a cache of decoded micro-operations) to skip decoding in some cases. You probably don't make such caches when decoding is easy. The P4 microarch had trace caches, but I believe that approach has since been avoided. What practically all contemporary x86 processors do have, though is u-op caches, which contain decoded micro-ops. Note this is not the same as a trace cache. For that matter, many ARM cores also have u-op caches, so it's not something that is uniquely useful only on x86. The Apple M* cores AFAIU do not have u-op caches, FWIW. |
|
| ▲ | camel-cdr 3 days ago | parent | prev | next [-] |
| > trace caches They don't anymore they have uop caches, but trace caches are great and apple uses them [1]. They allow you to collapse taken branches into a single fetch. Which is extreamly important, because the average instructions/taken-branch is about 10-15 [2]. With a 10 wide frontend, every second fetch would only be half utilized or worse. > extra caches This is one thing I don't understand, why not replace the L1I with the uop-cache entirely? I quite like what Ventana does with the Veyron V2/V3. [3,4]
They replaced the L1I with a macro-op trace cache, which can collapse taken branches, do basic instruction fusion and more advanced fusion for hot code paths. [1] https://www.realworldtech.com/forum/?threadid=223220 [2] https://lists.riscv.org/g/tech-profiles/attachment/353/0/RIS... (page 10) [3] https://www.ventanamicro.com/technology/risc-v-cpu-ip/ [4] https://youtu.be/EWgOVIvsZt8 |
| |
| ▲ | adgjlsfhk1 3 days ago | parent [-] | | you need both. Branches don't tell you "jump to this micro-op", they're "jump to this address" so you need the address numbering of a normal L1i. |
|
|
| ▲ | phire 3 days ago | parent | prev | next [-] |
| Intel’s E cores decode x86 without a trace cache (μop cache), and are very efficient. The latest (Skymont) can decode 9 x86 instructions per cycle, more than the P core (which can only decode 8) AMD isn’t saying that decoding x86 is easy. They are just saying that decoding x86 doesn’t have a notable power impact. |
| |
| ▲ | varispeed 3 days ago | parent [-] | | Does that really say anything about efficiency? Why can't they decode 100 instructions per cycle? | | |
| ▲ | ajross 3 days ago | parent | next [-] | | > Why can't they decode 100 instructions per cycle? Well, obviously because there aren't 100 individual parallel execution units to which those instructions could be issued. And lower down the stack because a 3000 bit[1] wide cache would be extremely difficult to manage. An instruction fetch would be six (!) cache lines wide, causing clear latency and bottleneck problems (or conversely would demand your icache be 6x wider, causing locality/granularity problems as many leaf functions are smaller than that). But also because real world code just isn't that parallel. Even assuming perfect branch prediction the number of instructions between unpredictable things like function pointer calls or computed jumps is much less than 100 in most performance-sensitive algorithms. And even if you could, the circuit complexity of decoding variable length instructions is superlinear. In x86, every byte can be an instruction boundary, but most aren't, and your decoder needs to be able to handle that. [1] I have in my head somewhere that "the average x86_64 instruction is 3.75 bytes long", but that may be off by a bit. Somewhere around that range, anyway. | | |
| ▲ | GeekyBear 3 days ago | parent [-] | | Wasn't the point of SMT that a single instruction decoder had difficulty keeping the core's existing execution units busy? | | |
| ▲ | ajross 3 days ago | parent | next [-] | | No, it's about instruction latency. Some instructions (cache misses that need to hit DRAM) will stall the pipeline and prevent execution of following instructions that depend on the result. So the idea is to keep two streams going at all times so that the other side can continue to fill the units. SMT can be (and was, on some Atom variants) a win even with an in-order architecture with only one pipeline. | |
| ▲ | imtringued 3 days ago | parent | prev | next [-] | | That's a gross misrepresentation of what SMT is to the point where nothing you said is correct. First of all. In SMT there is only one instruction decoder. SMT merely adds a second set of registers, which is why it is considered a "free lunch". The cost is small in comparison to the theoretical benefit (up to 2x performance). Secondly. The effectiveness of SMT is workload dependent, which is a property of the software and not the hardware. If you have a properly optimized workload that makes use of the execution units, e.g. a video game or simulation, the benefit is not that big or even negative, because you are already keeping the execution units busy and two threads end up sharing limited resources. Meanwhile if you have a web server written in python, then SMT is basically doubling your performance. So, it is in fact the opposite. For SMT to be effective, the instruction decoder has to be faster than your execution units, because there are a lot of instructions that don't even touch them. | |
| ▲ | BobbyTables2 3 days ago | parent | prev | next [-] | | I vaguely thought it was to provide another source of potentially “ready” instructions when the main thread was blocked on I/O to main memory (such as when register renaming can’t proceed because of dependencies). But I could be way off… | |
| ▲ | fulafel 3 days ago | parent | prev [-] | | No, it's about the same bottleneck that also explains the tapering off of single core performance. We can't extract more parallelism from the single flow-of-control of programs, because operations (and esp control flow transfers) are dependent on results of previous operations. SMT is about addressing the underutilization of execution resources where your 6-wide superscalar processor gets 2.0 ILP. See eg https://my.eng.utah.edu/~cs6810/pres/6810-09.pdf |
|
| |
| ▲ | eigenform 3 days ago | parent | prev [-] | | I think part of the argument is that doing a micro-op cache is not exactly cutting down on your power/area budget. (But then again, do the AMD e-cores have uop caches?) |
|
|
|
| ▲ | jabl 3 days ago | parent | prev | next [-] |
| > x86 also has flags, which add implicit dependencies between instructions, and must make designer's life harder. Fortunately flags (or even individual flag bits) can be renamed just like other registers, removing that bottleneck. And some architectures that use flag registers, like aarch64, have additional arithmetic instructions which don't update the flag register. Using flag registers brings benefits as well. E.g. conditional jump distances can be much larger (e.g. 1 MB in aarch64 vs. 4K in RISC-V). |
| |
| ▲ | adgjlsfhk1 3 days ago | parent [-] | | How big a difference is that? 1MB is still too small to jump to an arbitrary function, and 4K is big enough to almost always jump within a function. |
|
|
| ▲ | eigenform 3 days ago | parent | prev [-] |
| > [...] imagine that while you are loading 16 or 32 bytes from instruction cache, you need to predict the address of next loaded chunk in the same cycle, before you even see what you got from cache. Yeah, you [ideally] want to predict the existence of taken branches or jumps in a cache line! Otherwise you have cycles where you're inserting bubbles into the pipeline (if you aren't correctly predicting that the next-fetched line is just the next sequential one ..) |