There are alternative universes where these wouldn't be a problem.

For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation), then we could adjust SIMD width for existing binaries instead of waiting decades for a new baseline or multiversioning faff.

Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way.

▲

aengelke 3 months ago | parent | next [-]

> if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass (less involved than a full VM byte code compilation)

Apple tried something like this: they collected the LLVM bitcode of apps so that they could recompile and even port to a different architecture. To my knowledge, this was done exactly once (watchOS armv7->AArch64) and deprecated afterwards. Retargeting at this level is inherently difficult (different ABIs, target-specific instructions, intrinsics, etc.). For the same target with a larger feature set, the problems are smaller, but so are the gains -- better SIMD usage would only come from the auto-vectorizer and a better instruction selector that uses different instructions. The expectable gains, however, are low for typical applications and for math-heavy programs, using optimized libraries or simply recompiling is easier.

WebAssembly is a higher-level, more portable bytecode, but performance levels are quite a bit behind natively compiled code.

▲

LegionMammal978 3 months ago | parent | prev | next [-]

> Another interesting alternative is SIMT. Instead of having a handful of special-case instructions combined with heavyweight software-switched threads, we could have had every instruction SIMDified. It requires structuring programs differently, but getting max performance out of current CPUs already requires SIMD + multicore + predictable branching, so we're doing it anyway, just in a roundabout way.

Is that not where we're already going with the GPGPU trend? The big catch with GPU programming is that many useful routines are irreducibly very branchy (or at least, to an extent that removing branches slows them down unacceptably), and every divergent branch throws out a huge chunk of the GPU's performance. So you retain a traditional CPU to run all your branchy code, but you run into memory-bandwidth woes between the CPU and GPU.

It's generally the exception instead of the rule when you have a big block of data elements upfront that can all be handled uniformly with no branching. These usually have to do with graphics, physical simulation, etc., which is why the SIMT model was popularized by GPUs.

	▲	winwang 3 months ago \| parent \| next [-]
		Fun fact which I'm 50%(?) sure of: a single branch divergence for integer instructions on current nvidia GPUs won't hurt perf, because there are only 16 int32 lanes anyway.
	▲	pornel 3 months ago \| parent \| prev [-]
		CPUs are not good at branchy code either. Branch mispredictions cause costly pipeline stalls, so you have to make branches either predictable or use conditional moves. Trivially predictable branches are fast — but so are non-diverging warps on GPUs. Conditional moves and masked SIMD work pretty much exactly like on a GPU. Even if you have a branchy divide-and-conquer problem ideal for diverging threads, you'll get hit by a relatively high overhead of distributing work across threads, false sharing, and stalls from cache misses. My hot take is that GPUs will get more features to work better on traditionally-CPU-problems (e.g. AMD Shader Call proposal that helps processing unbalanced tree-structured data), and CPUs will be downgraded to being just a coprocessor for bootstrapping the GPU drivers.

▲

almostgotcaught 3 months ago | parent | prev | next [-]

> There are alternative universes where these wouldn't be a problem

Do people that say these things have literally any experience of merit?

> For example, if we didn't settle on executing compiled machine code exactly as-is, and had a instruction-updating pass

You do understand that at the end of the day, hardware is hard (fixed) and software is soft (malleable) right? There will be always be friction at some boundary - it doesn't matter where you hide the rigidity of a literal rock, you eventually reach a point where you cannot reconfigure something that you would like to. And also the parts of that rock that are useful are extremely expensive (so no one is adding instruction-updating pass silicon just because it would be convenient). That's just physics - the rock is very small but fully baked.

> we could have had every instruction SIMDified

Tell me you don't program GPUs without telling me. Not only is SIMT a literal lie today (cf warp level primitives), there is absolutely no reason to SIMDify all instructions (and you better be a wise user of your scalar registers and scalar instructions if you want fast GPU code).

I wish people would just realize there's no grand paradigm shift that's coming that will save them from the difficult work of actually learning how the device works in order to be able to use it efficiently.

▲

pornel 3 months ago | parent [-]

The point of updating the instructions isn't to have optimal behavior in all cases, or to reconfigure programs for wildly different hardware, but to be able to easily target contemporary hardware, without having to wait for the oldest hardware to die out first to be able to target a less outdated baseline without conditional dispatch.

Users are much more forgiving about software that runs a bit slower than software that doesn't run at all. ~95% of x86_64 CPUs have AVX2 support, but compiling binaries to unconditionally rely on it makes the remaining users complain. If it was merely slower on potato hardware, it'd be an easier tradeoff to make.

This is the norm on GPUs thanks to shader recompilation (they're far from optimal for all hardware, but at least get to use the instruction set of the HW they're running on, instead of being limited to the lowest common denominator). On CPUs it's happening in limited cases: Zen 3 added AVX-512 by executing two 256-bit operations serially, and plenty of less critical instructions are emulated in microcode, but it's done by the hardware, because our software isn't set up for that.

Compilers already need to make assumptions about pipeline widths and instruction latencies, so the code is tuned for specific CPU vendors/generations anyway, and that doesn't get updated. Less explicitly, optimized code also makes assumptions about cache sizes and compute vs memory trade-offs. Code may need L1 cache of certain size to work best, but it still runs on CPUs with a too-small L1 cache, just slower. Imagine how annoying it would be if your code couldn't take advantage of a larger L1 cache without crashing on older CPUs. That's where CPUs are with SIMD.

▲

almostgotcaught 3 months ago | parent [-]

i have no idea what you're saying - i'm well aware that compilers do lots of things but this sentence in your original comment

> compiled machine code exactly as-is, and had a instruction-updating pass

implies there should be silicon that implements the instruction-updating - what else would be "executing" compiled machine code other than the machine itself...........

▲

pornel 3 months ago | parent [-]

I was talking about a software pass. Currently, the machine code stored in executables (such as ELF or PE) is only slightly patched by the dynamic linker, and then expected to be directly executable by the CPU. The code in the file has to be already compatible with the target CPU, otherwise you hit illegal instructions. This is a simplistic approach, dating back to when running executables was just a matter of loading them into RAM and jumping to their start (old a.out or DOS COM).

What I'm suggesting is adding a translation/fixup step after loading a binary, before the code is executed, to make it more tolerant to hardware changes. It doesn’t have to be full abstract portable bytecode compilation, and not even as involved as PTX to SASS, but more like a peephole optimizer for the same OS on the same general CPU architecture. For example, on a pre-AVX2 x86_64 CPU, the OS could scan for AVX2 instructions and patch them to do equivalent work using SSE or scalar instructions. There are implementation and compatibility issues that make it tricky, but fundamentally it should be possible. Wilder things like x86_64 to aarch64 translation have been done, so let's do it for x86_64-v4 to x86_64-v1 too.

	▲	almostgotcaught 3 months ago \| parent [-]
		that's certainly more reasonable so i'm sorry for being so flippant. but even this idea i wager the juice is not worth the squeeze outside of stuff like Rosetta as you alluded, where the value was extremely high (retaining x86 customers).

▲

janwas 3 months ago | parent | prev [-]

hm. Doesn't the existence of Vulkan subgroups and CUDA shuffle/ballot poke huge holes in their 'SIMT' model? From where I sit, that looks a lot like SIMD. The only difference seems to be that SIMT professes to hide (or use HW support for) divergence. Apart from that, reductions and shuffles are basically SIMD.