▲ | pornel a day ago | ||||||||||||||||
There are alternative universes where these wouldn't be a problem. For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation), then we could adjust SIMD width for existing binaries instead of waiting decades for a new baseline or multiversioning faff. Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way. | |||||||||||||||||
▲ | aengelke a day ago | parent | next [-] | ||||||||||||||||
> if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation) Apple tried something like this: they collected the LLVM bitcode of apps so that they could recompile and even port to a different architecture. To my knowledge, this was done exactly once (watchOS armv7->AArch64) and deprecated afterwards. Retargeting at this level is inherently difficult (different ABIs, target-specific instructions, intrinsics, etc.). For the same target with a larger feature set, the problems are smaller, but so are the gains -- better SIMD usage would only come from the auto-vectorizer and a better instruction selector that uses different instructions. The expectable gains, however, are low for typical applications and for math-heavy programs, using optimized libraries or simply recompiling is easier. WebAssembly is a higher-level, more portable bytecode, but performance levels are quite a bit behind natively compiled code. | |||||||||||||||||
▲ | LegionMammal978 a day ago | parent | prev | next [-] | ||||||||||||||||
> Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way. Is that not where we're already going with the GPGPU trend? The big catch with GPU programming is that many useful routines are irreducibly very branchy (or at least, to an extent that removing branches slows them down unacceptably), and every divergent branch throws out a huge chunk of the GPU's performance. So you retain a traditional CPU to run all your branchy code, but you run into memory-bandwidth woes between the CPU and GPU. It's generally the exception instead of the rule when you have a big block of data elements upfront that can all be handled uniformly with no branching. These usually have to do with graphics, physical simulation, etc., which is why the SIMT model was popularized by GPUs. | |||||||||||||||||
| |||||||||||||||||
▲ | almostgotcaught 12 hours ago | parent | prev [-] | ||||||||||||||||
> There are alternative universes where these wouldn't be a problem Do people that say these things have literally any experience of merit? > For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass You do understand that at the end of the day, hardware is hard (fixed) and software is soft (malleable) right? There will be always be friction at some boundary - it doesn't matter where you hide the rigidity of a literal rock, you eventually reach a point where you cannot reconfigure something that you would like to. And also the parts of that rock that are useful are extremely expensive (so no one is adding instruction-updating pass silicon just because it would be convenient). That's just physics - the rock is very small but fully baked. > we could have had every instruction SIMDified Tell me you don't program GPUs without telling me. Not only is SIMT a literal lie today (cf warp level primitives), there is absolutely no reason to SIMDify all instructions (and you better be a wise user of your scalar registers and scalar instructions if you want fast GPU code). I wish people would just realize there's no grand paradigm shift that's coming that will save them from the difficult work of actually learning how the device works in order to be able to use it efficiently. | |||||||||||||||||
|