Remix.run Logo
geokon 6 hours ago

On a high level, do I understand correctly that SIMD is close to how the hardware works, while Vector Processor is more of an abstraction? The "Strip Mining" part looks like this translation to something SIMD-like. I seems like it's a good abstraction layers, but there is an implicit compilation step right? (making the "assembly" more easily run on different actual hardware)

Someone 6 hours ago | parent [-]

> On a high level, do I understand correctly that SIMD is close to how the hardware works, while Vector Processor is more of an abstraction?

Not quite. It still is the same “process whatever number of items you can in parallel, decrease count by that, repeat if necessary“ loop.

RISC-V decided to move the “decrease count by that, repeat if necessary” part into hardware, making the entire phrase “how the hardware works”.

Makes for shorter and nicer assembly. SIMD without it first has to query the CPU to find out how much parallelization it can handle (once) and do the “decrease count by that, repeat if necessary” part on the main CPU.

dzaima 4 hours ago | parent [-]

RVV still very much requires you to write a manual code/assembly loop doing the "compute how many elements can be handled, decrease count by that, repeat if necessary" thing. All it does is make it slightly less instructions to do so (and also allows handling a loops tail in the same loop while at it).

Joker_vD 4 hours ago | parent [-]

Yeah, except you don't need to rewrite that code every time a new AVX drops, and also don't need to bother to figure out what to do on older CPUs.

IIRC libc for x64 has several implementations of memcpy/memmov/strlen/etc. for different SSE/AVX extensions, which all get compiled in and shipped to your system; when libc is loaded for the first time, it figures out what is the latest extension the CPU it's running on actually supports and then patches its exports to point to the fastest working implementations.

dzaima 3 hours ago | parent [-]

You don't need to write a new loop every time a new vector size drops, but over time you'll still get more and more cases of wanting to write multiple copies of loops to take advantage of new instructions; there are already a good bunch of extensions of RVV (e.g. Zvbb has a good couple that are encounterable in general-purpose code), and many more to come (e.g. if we ever get vrgathers that don't scale quadratically with LMUL, some mask ops, and who knows what else will be understood as obviously-good-to-have in the future).

This kinda (though admittedly not entirely) balances out the x86 problem - sure, you have to write a new loop to take advantage of wider vector registers, but you often want to do that anyway - on SSE→AVX(2) you get to take advantage of non-destructive ops, all inline loads being unaligned, and a couple new nice instrs; on AVX2→AVX512 you get a ton of masking stuff, non-awful blends, among others.

RVV gets an advantage here largely due to just simply being a newer ISA, at a time where it is actually reasonably possible for even baseline hardware to support expensive compute instrs, complex shuffles, all unaligned mem ops (..though, actually, with RISC-V/RVV not mandating unaligned support (and allowing it to be extremely-slow even when supported) this is another thing you may want to write multiple loops for), and whatnot; whereas x86 SSE2 had to work on whatever could exist 20 years ago, and as such made respective compromises.

In some edge-cases the x86 approach can even be better - if you have some code that benefits from having different versions depending on hardware vector size (e.g. needs to use vrgather, or processes some fixed-size data that'd be really bad to write in a scalable way), on RVV you may end up needing to write a loop for each combination of VLEN and extension-set (i.e. a quadratic number of cases), whereas on x86 you only need to have a version of the loop for each desired extension-set.