Remix.run Logo
jcranmer 3 months ago

There's a category of autovectorization known as Superword-Level Parallelism (SLP) which effectively scavenges an entire basic block for individual instruction sequences that might be squeezed together into a SIMD instruction. This kind of vectorization doesn't work well with vector-length-agnostic ISAs, because you generally can't scavenge more than a few elements anyways, and inducing any sort of dynamic vector length is more likely to slow your code down as a result (since you can't do constant folding).

There's other kinds of interesting things you can do with vectors that aren't improved by dynamic-length vectors. Something like abseil's hash table, which uses vector code to efficiently manage the occupancy bitmap. Dynamic vector length doesn't help that much in that case, particularly because the vector length you can parallelize over is itself intrinsically low (if you're scanning dozens of elements to find an empty slot, something is wrong). Vector swizzling is harder to do dynamically, and in general, at high vector factors, difficult to do generically in hardware, which means going to larger vectors (even before considering dynamic sizes), vectorization is trickier if you have to do a lot of swizzling.

In general, vector-length-agnostic is really only good for SIMT-like codes, which you can express the vector body as more or less independent f(index) for some knowable-before-you-execute-the-loop range of indices. Stuff like DAXPY or BLAS in general. Move away from this model, and that agnosticism becomes overhead that doesn't pay for itself. (Now granted, this kind of model is a large fraction of parallelizable code, but it's far from all of it).