Remix.run Logo
dzaima 5 months ago

> Open source software is recompiled every week anyway.

Despite being potentially compiled recently, anything from most Linux package managers, and whatever precompiled downloadable executables, even if from open-source code, still targets the 20-year-old SSE2 baseline, wasting the majority of SIMD resources available on modern (..or just not-extremely-ancient) CPUs (unless you're looking at the 0.001% of software that bothers with dynamic dispatch; but for that approach just recompiling isn't enough, you also need to extend the dispatched target set).

RISC-V RVV's LMUL means that you get unrolling for free, as each instruction can operate over up to 8 registers per operand, i.e. essentially "hardware 8x unrolling", thereby making scalar overhead insignificant. (probably a minor nightmare from the silicon POV, but perhaps not in a particularly limiting way - double-pumping has been done by x86 many times so LMUL=2 is simple enough, and at LMUL=4 and LMUL=8 you can afford to decode/split into ups at 1 instr/cycle)

ARM SVE can encode adding a multiple of VL in load/store instructions, allowing manual unrolling without having to actually compute the intermediate sizes. (hardware-wise that's an extremely tiny amount of overhead, as it's trivially mappable to an immediate offset at decode time). And there's an instruction to bump a variable by a multiple of VL.

And you need to bump pointers in any SIMD regardless; the only difference is whether the bump size is an immediate, or a dynamic value, and if you control the ISA you can balance between the two as necessary. The packed SIMD approach isn't "free" either - you need hardware support for immediate offsets in SIMD load/store instrs.

Even in a hypothetical non-existent bad vector SIMD ISA without any applicable free offsetting in loads/stores and a need for unrolling, you can avoid having a dependency between unrolled iterations by precomputing "vlen*2", "vlen*3", "vlen*4", ... outside of the loop and adding those as necessary, instead of having a strict dependency chain.