Remix.run Logo
GeekyBear 3 days ago

Wasn't the point of SMT that a single instruction decoder had difficulty keeping the core's existing execution units busy?

ajross 3 days ago | parent | next [-]

No, it's about instruction latency. Some instructions (cache misses that need to hit DRAM) will stall the pipeline and prevent execution of following instructions that depend on the result. So the idea is to keep two streams going at all times so that the other side can continue to fill the units. SMT can be (and was, on some Atom variants) a win even with an in-order architecture with only one pipeline.

imtringued 3 days ago | parent | prev | next [-]

That's a gross misrepresentation of what SMT is to the point where nothing you said is correct.

First of all. In SMT there is only one instruction decoder. SMT merely adds a second set of registers, which is why it is considered a "free lunch". The cost is small in comparison to the theoretical benefit (up to 2x performance).

Secondly. The effectiveness of SMT is workload dependent, which is a property of the software and not the hardware.

If you have a properly optimized workload that makes use of the execution units, e.g. a video game or simulation, the benefit is not that big or even negative, because you are already keeping the execution units busy and two threads end up sharing limited resources. Meanwhile if you have a web server written in python, then SMT is basically doubling your performance.

So, it is in fact the opposite. For SMT to be effective, the instruction decoder has to be faster than your execution units, because there are a lot of instructions that don't even touch them.

BobbyTables2 3 days ago | parent | prev | next [-]

I vaguely thought it was to provide another source of potentially “ready” instructions when the main thread was blocked on I/O to main memory (such as when register renaming can’t proceed because of dependencies).

But I could be way off…

fulafel 3 days ago | parent | prev [-]

No, it's about the same bottleneck that also explains the tapering off of single core performance. We can't extract more parallelism from the single flow-of-control of programs, because operations (and esp control flow transfers) are dependent on results of previous operations.

SMT is about addressing the underutilization of execution resources where your 6-wide superscalar processor gets 2.0 ILP.

See eg https://my.eng.utah.edu/~cs6810/pres/6810-09.pdf