> There are alternative universes where these wouldn't be a problem

Do people that say these things have literally any experience of merit?

> For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass

You do understand that at the end of the day, hardware is hard (fixed) and software is soft (malleable) right? There will be always be friction at some boundary - it doesn't matter where you hide the rigidity of a literal rock, you eventually reach a point where you cannot reconfigure something that you would like to. And also the parts of that rock that are useful are extremely expensive (so no one is adding instruction-updating pass silicon just because it would be convenient). That's just physics - the rock is very small but fully baked.

> we could have had every instruction SIMDified

Tell me you don't program GPUs without telling me. Not only is SIMT a literal lie today (cf warp level primitives), there is absolutely no reason to SIMDify all instructions (and you better be a wise user of your scalar registers and scalar instructions if you want fast GPU code).

I wish people would just realize there's no grand paradigm shift that's coming that will save them from the difficult work of actually learning how the device works in order to be able to use it efficiently.

▲

pornel 3 months ago | parent [-]

The point of updating the instructions isn't to have optimal behavior in all cases, or to reconfigure programs for wildly different hardware, but to be able to easily target contemporary hardware, without having to wait for the oldest hardware to die out first to be able to target a less outdated baseline without conditional dispatch.

Users are much more forgiving about software that runs a bit slower than software that doesn't run at all. ~95% of x86_64 CPUs have AVX2 support, but compiling binaries to unconditionally rely on it makes the remaining users complain. If it was merely slower on potato hardware, it'd be an easier tradeoff to make.

This is the norm on GPUs thanks to shader recompilation (they're far from optimal for all hardware, but at least get to use the instruction set of the HW they're running on, instead of being limited to the lowest common denominator). On CPUs it's happening in limited cases: Zen 3 added AVX-512 by executing two 256-bit operations serially, and plenty of less critical instructions are emulated in microcode, but it's done by the hardware, because our software isn't set up for that.

Compilers already need to make assumptions about pipeline widths and instruction latencies, so the code is tuned for specific CPU vendors/generations anyway, and that doesn't get updated. Less explicitly, optimized code also makes assumptions about cache sizes and compute vs memory trade-offs. Code may need L1 cache of certain size to work best, but it still runs on CPUs with a too-small L1 cache, just slower. Imagine how annoying it would be if your code couldn't take advantage of a larger L1 cache without crashing on older CPUs. That's where CPUs are with SIMD.

▲

almostgotcaught 3 months ago | parent [-]

i have no idea what you're saying - i'm well aware that compilers do lots of things but this sentence in your original comment

> compiled machine code exactly as-is, and had a instruction-updating pass

implies there should be silicon that implements the instruction-updating - what else would be "executing" compiled machine code other than the machine itself...........

▲

pornel 3 months ago | parent [-]

I was talking about a software pass. Currently, the machine code stored in executables (such as ELF or PE) is only slightly patched by the dynamic linker, and then expected to be directly executable by the CPU. The code in the file has to be already compatible with the target CPU, otherwise you hit illegal instructions. This is a simplistic approach, dating back to when running executables was just a matter of loading them into RAM and jumping to their start (old a.out or DOS COM).

What I'm suggesting is adding a translation/fixup step after loading a binary, before the code is executed, to make it more tolerant to hardware changes. It doesn’t have to be full abstract portable bytecode compilation, and not even as involved as PTX to SASS, but more like a peephole optimizer for the same OS on the same general CPU architecture. For example, on a pre-AVX2 x86_64 CPU, the OS could scan for AVX2 instructions and patch them to do equivalent work using SSE or scalar instructions. There are implementation and compatibility issues that make it tricky, but fundamentally it should be possible. Wilder things like x86_64 to aarch64 translation have been done, so let's do it for x86_64-v4 to x86_64-v1 too.

	▲	almostgotcaught 3 months ago \| parent [-]
		that's certainly more reasonable so i'm sorry for being so flippant. but even this idea i wager the juice is not worth the squeeze outside of stuff like Rosetta as you alluded, where the value was extremely high (retaining x86 customers).