▲ | almostgotcaught 12 hours ago | |||||||
> There are alternative universes where these wouldn't be a problem Do people that say these things have literally any experience of merit? > For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass You do understand that at the end of the day, hardware is hard (fixed) and software is soft (malleable) right? There will be always be friction at some boundary - it doesn't matter where you hide the rigidity of a literal rock, you eventually reach a point where you cannot reconfigure something that you would like to. And also the parts of that rock that are useful are extremely expensive (so no one is adding instruction-updating pass silicon just because it would be convenient). That's just physics - the rock is very small but fully baked. > we could have had every instruction SIMDified Tell me you don't program GPUs without telling me. Not only is SIMT a literal lie today (cf warp level primitives), there is absolutely no reason to SIMDify all instructions (and you better be a wise user of your scalar registers and scalar instructions if you want fast GPU code). I wish people would just realize there's no grand paradigm shift that's coming that will save them from the difficult work of actually learning how the device works in order to be able to use it efficiently. | ||||||||
▲ | pornel 4 hours ago | parent [-] | |||||||
The point of updating the instructions isn't to have optimal behavior in all cases, or to reconfigure programs for wildly different hardware, but to be able to easily target contemporary hardware, without having to wait for the oldest hardware to die out first to be able to target a less outdated baseline without conditional dispatch. Users are much more forgiving about software that runs a bit slower than software that doesn't run at all. ~95% of x86_64 CPUs have AVX2 support, but compiling binaries to unconditionally rely on it makes the remaining users complain. If it was merely slower on potato hardware, it'd be an easier tradeoff to make. This is the norm on GPUs thanks to shader recompilation (they're far from optimal for all hardware, but at least get to use the instruction set of the HW they're running on, instead of being limited to the lowest common denominator). On CPUs it's happening in limited cases: Zen 3 added AVX-512 by executing two 256-bit operations serially, and plenty of less critical instructions are emulated in microcode, but it's done by the hardware, because our software isn't set up for that. Compilers already need to make assumptions about pipeline widths and instruction latencies, so the code is tuned for specific CPU vendors/generations anyway, and that doesn't get updated. Less explicitly, optimized code also makes assumptions about cache sizes and compute vs memory trade-offs. Code may need L1 cache of certain size to work best, but it still runs on CPUs with a too-small L1 cache, just slower. Imagine how annoying it would be if your code couldn't take advantage of a larger L1 cache without crashing on older CPUs. That's where CPUs are with SIMD. | ||||||||
|