Remix.run Logo
varispeed 3 days ago

Does that really say anything about efficiency? Why can't they decode 100 instructions per cycle?

ajross 3 days ago | parent | next [-]

> Why can't they decode 100 instructions per cycle?

Well, obviously because there aren't 100 individual parallel execution units to which those instructions could be issued. And lower down the stack because a 3000 bit[1] wide cache would be extremely difficult to manage. An instruction fetch would be six (!) cache lines wide, causing clear latency and bottleneck problems (or conversely would demand your icache be 6x wider, causing locality/granularity problems as many leaf functions are smaller than that).

But also because real world code just isn't that parallel. Even assuming perfect branch prediction the number of instructions between unpredictable things like function pointer calls or computed jumps is much less than 100 in most performance-sensitive algorithms.

And even if you could, the circuit complexity of decoding variable length instructions is superlinear. In x86, every byte can be an instruction boundary, but most aren't, and your decoder needs to be able to handle that.

[1] I have in my head somewhere that "the average x86_64 instruction is 3.75 bytes long", but that may be off by a bit. Somewhere around that range, anyway.

GeekyBear 3 days ago | parent [-]

Wasn't the point of SMT that a single instruction decoder had difficulty keeping the core's existing execution units busy?

ajross 3 days ago | parent | next [-]

No, it's about instruction latency. Some instructions (cache misses that need to hit DRAM) will stall the pipeline and prevent execution of following instructions that depend on the result. So the idea is to keep two streams going at all times so that the other side can continue to fill the units. SMT can be (and was, on some Atom variants) a win even with an in-order architecture with only one pipeline.

imtringued 3 days ago | parent | prev | next [-]

That's a gross misrepresentation of what SMT is to the point where nothing you said is correct.

First of all. In SMT there is only one instruction decoder. SMT merely adds a second set of registers, which is why it is considered a "free lunch". The cost is small in comparison to the theoretical benefit (up to 2x performance).

Secondly. The effectiveness of SMT is workload dependent, which is a property of the software and not the hardware.

If you have a properly optimized workload that makes use of the execution units, e.g. a video game or simulation, the benefit is not that big or even negative, because you are already keeping the execution units busy and two threads end up sharing limited resources. Meanwhile if you have a web server written in python, then SMT is basically doubling your performance.

So, it is in fact the opposite. For SMT to be effective, the instruction decoder has to be faster than your execution units, because there are a lot of instructions that don't even touch them.

BobbyTables2 3 days ago | parent | prev | next [-]

I vaguely thought it was to provide another source of potentially “ready” instructions when the main thread was blocked on I/O to main memory (such as when register renaming can’t proceed because of dependencies).

But I could be way off…

fulafel 3 days ago | parent | prev [-]

No, it's about the same bottleneck that also explains the tapering off of single core performance. We can't extract more parallelism from the single flow-of-control of programs, because operations (and esp control flow transfers) are dependent on results of previous operations.

SMT is about addressing the underutilization of execution resources where your 6-wide superscalar processor gets 2.0 ILP.

See eg https://my.eng.utah.edu/~cs6810/pres/6810-09.pdf

eigenform 3 days ago | parent | prev [-]

I think part of the argument is that doing a micro-op cache is not exactly cutting down on your power/area budget.

(But then again, do the AMD e-cores have uop caches?)