Remix.run Logo
PaulKeeble 6 days ago

This is bang on, you can't count the hyperthreads as double the performance, typically they are actually in practice only going to bring 15-30% if the job works well with it and their use will double the latency. Failing to account for loss in clockspeed as the core utilisation climbs is another way its not linear and in modern software for the desktop its really something to pay careful attention to.

It should be possible from the information you can get on a CPU from the OS to better estimate utilisation involving at the very least these two factors. It becomes a bit more tricky to start to account for significantly going past the cache or available memory bandwidth and the potential drop in performance to existing threads that occurs from the increased pipeline stalls. But it can definitely be done better than it is currently.

c2h5oh 6 days ago | parent | next [-]

To complicate things more HT performance varies wildly between CPU architectures and workloads. e.g. AMD implementation, especially in later Zen cores, is closer to a performance of a full thread than you'd see in Intel CPUs. Provided you are not memory bandwidth starved.

RaftPeople 6 days ago | parent | next [-]

> To complicate things more HT performance varies wildly between CPU architectures and workloads.

IBM's Power cpu's have also traditionally done a great job with SMT compared to Intel's implementation.

shim__ 6 days ago | parent | prev [-]

Whats the difference between Intels and AMDs approach?

richardwhiuk 6 days ago | parent [-]

Basically it comes down to how much shared vs dedicated resources each core has.

magicalhippo 6 days ago | parent | prev | next [-]

For memory-bound applications the scaling can be much better. A renderer I worked on was primarily memory-bound walking the accelerator structure, and saw 60-70% increase from hyperthreads.

But overall yeah.

Sohcahtoa82 6 days ago | parent | prev [-]

Back when I got an i7-3770K (4C/8T), I did a very basic benchmark using POV-Ray.

Going from 1 thread to 2 threads doubled the speed as expected. Going from 2 to 4 doubled it again. Going from 4 to 8 was only ~15% faster.

I imagine you could probably create a contrived benchmark that actually gives you nearly double the performance from SMT, but I don't know what it would look like. Maybe some benchmark that is written to deliberately constantly miss cache?

Side note, I should run that POV-Ray test again. It's been years since I've even use POV-Ray.