▲ | LegionMammal978 5 months ago | |
> Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way. Is that not where we're already going with the GPGPU trend? The big catch with GPU programming is that many useful routines are irreducibly very branchy (or at least, to an extent that removing branches slows them down unacceptably), and every divergent branch throws out a huge chunk of the GPU's performance. So you retain a traditional CPU to run all your branchy code, but you run into memory-bandwidth woes between the CPU and GPU. It's generally the exception instead of the rule when you have a big block of data elements upfront that can all be handled uniformly with no branching. These usually have to do with graphics, physical simulation, etc., which is why the SIMT model was popularized by GPUs. | ||
▲ | winwang 5 months ago | parent | next [-] | |
Fun fact which I'm 50%(?) sure of: a single branch divergence for integer instructions on current nvidia GPUs won't hurt perf, because there are only 16 int32 lanes anyway. | ||
▲ | pornel 5 months ago | parent | prev [-] | |
CPUs are not good at branchy code either. Branch mispredictions cause costly pipeline stalls, so you have to make branches either predictable or use conditional moves. Trivially predictable branches are fast — but so are non-diverging warps on GPUs. Conditional moves and masked SIMD work pretty much exactly like on a GPU. Even if you have a branchy divide-and-conquer problem ideal for diverging threads, you'll get hit by a relatively high overhead of distributing work across threads, false sharing, and stalls from cache misses. My hot take is that GPUs will get more features to work better on traditionally-CPU-problems (e.g. AMD Shader Call proposal that helps processing unbalanced tree-structured data), and CPUs will be downgraded to being just a coprocessor for bootstrapping the GPU drivers. |