| ▲ | hinkley 6 days ago |
| How many times has hyperthreading been an actual performance benefit in processors? I cannot count how many times an article has come out saying you'll get better performance out of your <insert processor here> by turning off hyperthreading in the BIOS. It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with. |
|
| ▲ | loeg 6 days ago | parent | next [-] |
| HT provides a significant benefit to many workloads. The use cases that benefit from actually disabling HT are likely working around pessimal OS scheduler or application thread use. (After all, even with it enabled, you're free to not use the sibling cores.) Otherwise, it is an overgeneralization to say that disabling it will benefit arbitrary workloads. |
| |
| ▲ | hedora 6 days ago | parent | next [-] | | There’s some argument that you should jam stuff on to as few hyperthread pairs as possible to improve energy efficiency and cache locality. Of course, if the CPU governor is set to “performance” or “game mode”, then the OS should use as many pairs as possible instead (unless thermal throttling matters; computers are hard). | |
| ▲ | mkbosmans 6 days ago | parent | prev | next [-] | | Especially in HPC there are lots of workloads that do not benefit from SMT. Such workloads are almost always bottlenecked on either memory bandwidth or vector execution ports. These are exactly the resources that are shared between the sibling threads. So now you have a choice of either disabling SMT in the bios, or make sure the application correctly interprets the CPU topology and only spawns one thread per physical core. The former is often the easier option, both from software development and system administration perspective. | | |
| ▲ | PunchyHamster 6 days ago | parent | next [-] | | HT cores can still run OS stuff in that case as that isn't really in contention with those. Tho I can see someone not wanting to bother with pinning | |
| ▲ | skeezyboy 6 days ago | parent | prev [-] | | >Especially in HPC there are lots of workloads that do not benefit from SMT...So now you have a choice of either disabling SMT in the bios Thats madness. Theyre cheaper than their all-core equivalent. Why even buy one in the first place if HT slows down the CPU? Youre still better off with them enabled. |
| |
| ▲ | robocat 6 days ago | parent | prev [-] | | > use cases that benefit from actually disabling HT Other benefits: per-CPU software licencing sometimes, and security on servers that share CPU with multiple clients. |
|
|
| ▲ | twoodfin 6 days ago | parent | prev | next [-] |
| For whatever it’s worth, operational database systems (many users/connections, unpredictable access patterns) are beneficiaries of modern hyperthreading. I’m familiar with one such system where the throughput benefit is ~15%, which is a big deal for a BIOS flag. IBM’s POWER would have been discontinued a decade ago were it not for transactional database systems, and that architecture is heavily invested in SMT, up to 8-way(!) |
| |
| ▲ | jiggawatts 6 days ago | parent | next [-] | | I've noticed an overreliance on throughput as measured during 100% load as the performance metric, which has resulted in hardware vendors "optimising to the test" at the expense of other, arguably more important metrics. For example: single-user latency when the server is just 50% loaded. | | |
| ▲ | twoodfin 6 days ago | parent | next [-] | | That’s more than fair. In the system I’m most familiar with, however, the benefits of hyperthreading for throughput extend to the 50-70% utilization band where p99 latency is not stressed. | |
| ▲ | hinkley 6 days ago | parent | prev [-] | | Or p98 time for requests. Throughput and latency are usually at odds with each other. |
| |
| ▲ | tom_ 6 days ago | parent | prev [-] | | Why do they need so many threads? This really feels like they just designed the cpu poorly, in that it can't extract enough parallelism out of the instruction stream already. (Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.) | | |
| ▲ | ckozlowski 6 days ago | parent | next [-] | | As I recall it, Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline. As I remember it described at the time, P4 had 30+ stages in it's pipeline. Many of them did not need to be used in a given thread. Furthermore, if a branch prediction engine guessed wrong, then the pipeline needed to be cleared and started anew. For a 30+ stage pipeline, that's a lot of wasted clock cycles. So hyper-threading was a way to recoup some of those losses. I recall reading at the time that it was a "latency hiding technique". How effective it was I leave to others. But it became standard it seems on all x86 processors in time. Core and Core 2 didn't seem to need it (much shorter pipelines) but later Intel and AMD processors got it. This is how it was explained to me at the time anyways. I was working at an OEM from '02-'05, and I recall when this feature came out. I pulled out my copy of "Inside the Machine" by Jon Stokes which goes deep into the P4 architecture, but strangely I can only find a single mention of hyperthreading in the book. But it goes far into the P4 architecture and why branch misses are so punishing. It's a good read. Edit: Adding that I suspect instruction pipelines are not so long that adding additional threads would help. I suspect diminishing returns past 2. | | |
| ▲ | justsomehnguy 6 days ago | parent | next [-] | | > As I recall it, Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline. Well, Intel brought Hyperthreading to Xeon first and they were quite slow, so the additional thread performance were quite welcome there. But the GHz race was lead to the monstruosity of 3.06GHz CPUs where the improvement in speed didn't quite translated to the improvement in performance. And while the Northwood fared well (especially considering the disaster of Willamette) GHz/performance wise, the Prescott wasn't and mostly showed the same performance in non-SSE/cache bound tasks[1], so Intel needed to push the GHz further which required a longer pipeline and brought even more penalty on a prediction miss. Well, at least this is how I remember it. [0] https://en.wikipedia.org/wiki/List_of_Intel_Xeon_processors_... [1] but excelled in the room heating, people joked what they even didn't bother with an apartment heating in winter, just leaving a computer running | |
| ▲ | bee_rider 6 days ago | parent | prev [-] | | Any time somebody mentions the Pentium 4, it feels like a peek at a time-line we didn’t end up going down. Imagine if Intel had stuck to their guns, maybe they could have pushed through and we’d have CPUs with ridiculous 90 stage pipelines, and like 4 threads per core. Maybe frameworks, languages, and programmer experience would have conspired to help write programs with threads that work together very closely, taking advantage of the shared cache of the hyperthreads. I mean, it obviously didn’t happen, but it is fun to wonder about. |
| |
| ▲ | TristanBall 6 days ago | parent | prev | next [-] | | I suspect part of it is licensing games, both in the sense of "avoiding per core license limits" which absolutely matters when your DB is costing a million bucks, and also in the 'enable the highest PVU score per chassis' for ibm's own license farming. Power systems tend not to be under the same budget constraints as intel, whether thats money, power, heat, whatever, so the cost benifit of adding more sub-core processing for incremental gains is likely different too. I may have a raft of issues with IBM, and aix, but those Power chips are top notch. | | |
| ▲ | hinkley 6 days ago | parent [-] | | Yeah that was another thing. You run Oracle you gotta turn that shit off in the BIOS otherwise you're getting charged 2x for 20% more performance. | | |
| ▲ | wmf 6 days ago | parent [-] | | AFAIK Oracle does not charge extra for SMT. |
|
| |
| ▲ | twoodfin 6 days ago | parent | prev | next [-] | | Low-latency databases are architected to be memory-bandwidth bound. SMT allows more connections to be generating more loads faster, utilizing more memory bandwidth. Think async or green threads, but for memory or branch misses rather than blocking I/O. (As mentioned elsewhere, optimizing for vendor licensing practices is a nice side benefit, but obviously if the vendors want $X for Y compute on their database, they’ll charge that somehow.) | |
| ▲ | wmf 6 days ago | parent | prev [-] | | Power does have higher memory latency because of OMI and it supports more sockets. But I think the main motivation for SMT8 is low-IPC spaghetti code. |
|
|
|
| ▲ | BrendanLong 6 days ago | parent | prev | next [-] |
| To be fair, in most of these tests hyperthreading did provide a significant benefit (in the general CPU stress test, the hyperthreads increased performance by ~66%). It's just confusing that utilization metrics treat hyperthread usage the same as full physical cores. |
|
| ▲ | bee_rider 6 days ago | parent | prev | next [-] |
| Those weird Xeon Phi accelerators had 4 threads per core, and IIRC needed at least 2 running to get full performance. They were sort of niche, though. I guess in general parallelism inside a core will either be extracted by the computer automatically with instruction-level-parallelism, or the programmer can tell it about independent tasks, using hyperthreads. So the hyperthread implementations are optimistic about how much progrmmers care about performance, haha. |
| |
| ▲ | mkbosmans 6 days ago | parent [-] | | Sort of niche indeed. In addition to needing SMT to get full performance, there were a lot of other small details you needed to get right on Xeon Phi to get close to the advertised performance. Think of AVX512 and the HBM. For practical applications, it never really delivered. |
|
|
| ▲ | tgma 6 days ago | parent | prev | next [-] |
| It has a lot to do with your workload as well as if not moreso than the chip architecture. The primary trade-off is the cache utilization when executing two sets of instruction streams. |
| |
| ▲ | hinkley 6 days ago | parent [-] | | That's likely the primary factor, but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU. | | |
| ▲ | tgma 6 days ago | parent | next [-] | | May be true for FMA or AVX2 or similar stuff. Outside vector units that sounds implausible. Obviously multi core thermal throttling is a thing but that would by far dominate. Hyperthreading should have minimal impact there. | |
| ▲ | gruez 6 days ago | parent | prev [-] | | >but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU. That doesn't make any sense. Disabling SMT likely saves negligible amount of power, but disables any performance to be gained from the other thread. If there's thermal budget available, it's better to spend it by shoving more work onto the second thread than to leave it disabled. If anything, due to voltage/frequency curves, it might even be better to run your CPU at lower clocks but with SMT enabled to make up for it (assuming it's amenable to your workloads), than it is to run with SMT disabled. |
|
|
|
| ▲ | duped 6 days ago | parent | prev | next [-] |
| For me today it's definitely a pessimation because I have enough well-meaning applications that spawn `nproc` worker threads. Which would be fine if they're the only process running, but they're not. |
| |
| ▲ | hinkley 6 days ago | parent | next [-] | | I wrote a little tool for our services that could do basic expression based off of nproc based on an environment variable at startup time. You could do one thread for every two cores, three threads for every 2 cores, one thread per core ± 1, or both (2n + 1). Unfortunately the sweet spot based on our memory usage always came out to 1:1, except for a while when we had a memory leak that was surprisingly hard to fix, and we ran n - 1 for about 4 months while a bunch of work and exploratory testing were done. We had to tune in other places to maximize throughput. | |
| ▲ | toast0 6 days ago | parent | prev [-] | | Wouldn't that be about the same badness without hyperthreads? If you're oversubscribed, there might be some benefit to having fewer tasks, but maybe you get some good throughput with two different application's threads running on opposite hyperthreads. | | |
| ▲ | hinkley 6 days ago | parent [-] | | Oversubscribing also leads to process migration, which these days leads to memory read delays. |
|
|
|
| ▲ | esseph 6 days ago | parent | prev | next [-] |
| Intel vs AMD, you'll get a different answer on the hyperthreading question. https://www.tomshardware.com/pc-components/cpus/zen-4-smt-fo... |
|
| ▲ | toast0 6 days ago | parent | prev | next [-] |
| Going from 1 core to 2 hyperthreads was a big bonus in interactivity. But I think it was easy to get early systems to show worse throughput. I think there's two kinds of loads where hyperthreads aren't more likely to hurt than help. If you've got a tight loop that uses all the processor execution resources, you're not gaining anything by splitting that in two, it just makes things harder. Or if your load is mostly bound by memory bandwidth without a lot of compute... having more threads probably means you're that much more oversubscribed on i/o and caching. But a lot of loads are grab some stuff from memory and then do some compute, rinse and repeat. There's a lot of potential for idle time while waiting on a load, being able to run something else during that time makes a lot of sense. It's worth checking how your load performs with hyperthreads off, but I think default on is probably the right choice. |
| |
| ▲ | sroussey 6 days ago | parent [-] | | Definitely measure both ways and decide. For many years (still?) it was faster to run your database with hyper threading turned off and your app server with it turned on. |
|
|
| ▲ | FpUser 6 days ago | parent | prev | next [-] |
| In the old days it had made the difference between my multimedia game like application not working at all with hyperthreading off to working just fine with it on. |
| |
| ▲ | hinkley 6 days ago | parent [-] | | Yeah when it was one core versus 1.3 cores that's fair. But 3 core machines often did better (or at least more consistently run to run) with HT disabled. |
|
|
| ▲ | tom_ 6 days ago | parent | prev [-] |
| Total throughout has always seemed better with it switched on for me, even for stuff that isn't hyper threading friendly. You get a free 10% at least. |