| ▲ | camel-cdr a day ago |
| > I prefer fixed width Do you have examples for problems that are easier to solve in fixed-width SIMD? I maintain that most problems can be solved in a vector-length-agnostic manner. Even if it's slightly more tricky, it's certainly easier than restructuring all of your memory allocations to add padding and implementing three versions for all the differently sized SIMD extensions your target may support.
And you can always fall back to using a variable-width SIMD ISA in a fixed-width way, when necessary. |
|
| ▲ | jcranmer a day ago | parent | next [-] |
| There's a category of autovectorization known as Superword-Level Parallelism (SLP) which effectively scavenges an entire basic block for individual instruction sequences that might be squeezed together into a SIMD instruction. This kind of vectorization doesn't work well with vector-length-agnostic ISAs, because you generally can't scavenge more than a few elements anyways, and inducing any sort of dynamic vector length is more likely to slow your code down as a result (since you can't do constant folding). There's other kinds of interesting things you can do with vectors that aren't improved by dynamic-length vectors. Something like abseil's hash table, which uses vector code to efficiently manage the occupancy bitmap. Dynamic vector length doesn't help that much in that case, particularly because the vector length you can parallelize over is itself intrinsically low (if you're scanning dozens of elements to find an empty slot, something is wrong). Vector swizzling is harder to do dynamically, and in general, at high vector factors, difficult to do generically in hardware, which means going to larger vectors (even before considering dynamic sizes), vectorization is trickier if you have to do a lot of swizzling. In general, vector-length-agnostic is really only good for SIMT-like codes, which you can express the vector body as more or less independent f(index) for some knowable-before-you-execute-the-loop range of indices. Stuff like DAXPY or BLAS in general. Move away from this model, and that agnosticism becomes overhead that doesn't pay for itself. (Now granted, this kind of model is a large fraction of parallelizable code, but it's far from all of it). |
| |
| ▲ | camel-cdr a day ago | parent | next [-] | | The SLP vectorizer is a good point, but I think it's, in comparison with x86, more a problem of the float and vector register files not being shared (in SVE and RVV).
You don't need to reconfigure the vector length; just use it at the full width. > Something like abseil's hash table If I remember this correctly, the abseil lookup does scale with vector length, as long as you use the native data path width. (albeit with small gains)
There is a problem with vector length agnostic handling of abseil, which is the iterator API.
With a different API, or compilers that could eliminate redundant predicated load/stores, this would be easier. > good for SIMT-like codes Certainly, but I've also seen/written a lot of vector length agnostic code using shuffles, which don't fit into the SIMT paradigm, which means that the scope is larger than just SIMT. --- As a general comparison, take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, remove the few instructions that don't make sense anymore, and add a cheap instruction to query the vector length. (probably also instructions like vid and viota, for easier shuffle synthesization)
Now you have a variable-length SIMD ISA that feels familiar. The above is basically what SVE is. | |
| ▲ | jonstewart 21 hours ago | parent | prev [-] | | A number of the cool string processing SIMD techniques depend a _lot_ on register widths and instruction performance characteristics. There’s a fair argument to be made that x64 could be made more consistent/legible for these use cases, but this isn’t matmul—whether you have 128, 256, or 512 bits matters hugely and you may want entirely different algorithms that are contingent on this. |
|
|
| ▲ | jandrewrogers a day ago | parent | prev | next [-] |
| I also prefer fixed width. At least in C++, all of the padding, alignment, etc is automagically codegen-ed for the register type in my use cases, so the overhead is approximately zero. All the complexity and cost is in specializing for the capabilities of the underlying SIMD ISA, not the width. The benefit of fixed width is that optimal data structure and algorithm design on various microarchitectures is dependent on explicitly knowing the register width. SIMD widths aren’t not perfectly substitutable in practice, there is more at play than stride size. You can also do things like explicitly combine separate logic streams in a single SIMD instruction based on knowing the word layout. Compilers don’t do this work in 2025. The argument for vector width agnostic code seems predicated on the proverbial “sufficiently advanced compiler”. I will likely retire from the industry before such a compiler actually exists. Like fusion power, it has been ten years away my entire life. |
| |
| ▲ | camel-cdr a day ago | parent [-] | | > The argument for vector width agnostic code is seems predicated on the proverbial “sufficiently advanced compiler”. A SIMD ISA having a fixed size or not is orthogonal to autovectorization.
E.g. I've seen a bunch of cases where things get autovectorized for RVV but not for AVX512. The reason isn't fixed vs variable, but rather the supported instructions themselves. There are two things I'd like from a "sufficiently advanced compiler”, which are sizeless struct support and redundant predicated load/store elimination.
Those don't fundamentally add new capabilities, but makes working with/integrating into existing APIs easier. > All the complexity and cost is in specializing for the capabilities of the underlying SIMD ISA, not the width. Wow, it almost sounds like you could take basically the same code and run it with different vector lengths. > The benefit of fixed width is that optimal data structure and algorithm design on various microarchitectures is dependent on explicitly knowing the register width Optimal to what degree? Like sure, fixed-width SIMD can always turn your pointer increments from a register add to an immediate add, so it's always more "optimal", but that sort of thing doesn't matter. The only difference you usually encounter when writing variable instead of fixed size code is that you have to synthesize your shuffles outside the loop.
This usually just takes a few instructions, but loading a constant is certainly easier. | | |
| ▲ | jandrewrogers 20 hours ago | parent [-] | | The interplay of SIMD width and microarchitecture is more important for performance engineering than you seem to be assuming. Those codegen decisions are made at layer above anything being talked about here and they operate on explicit awareness of things like register size. It isn’t “same instruction but wider or narrower” or anything that can be trivially autovectorized, it is “different algorithm design”. Compilers are not yet rewriting data structures and algorithms based on microarchitecture. I write a lot of SIMD code, mostly for database engines, little of which is trivial “processing a vector of data types” style code. AVX512 in particular is strong enough of an ISA that it is used in all kinds of contexts that we traditionally wouldn’t think of as a good for SIMD. You can build all kinds of neat quasi-scalar idioms with it and people do. | | |
| ▲ | synthos 12 hours ago | parent [-] | | > Compilers are not yet rewriting data structures and algorithms based on microarchitecture. Afaik mojo allows for this with the autotuning capability and metaprogramming |
|
|
|
|
| ▲ | pkhuong a day ago | parent | prev [-] |
| > Do you have examples for problems that are easier to solve in fixed-width SIMD? Regular expression matching and encryption come to mind. |
| |
| ▲ | camel-cdr a day ago | parent [-] | | > Regular expression matching That's probably true.
Last time I looked at it, it seemed like parts of vectorscan could be vectorized VLA, but from my, very limited, understanding of the main matching algorithm, it does seem to require specialization on vector length. It should be possible to do VLA in some capacity, but it would probably be slower and it's too much work to test. > encryption From the things I've looked at, it's mixed. E.g. chacha20 and poly1305 vectorize well in a VLA scheme: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/chacha2..., https://camel-cdr.github.io/rvv-bench-results/bpi_f3/poly130... Keccak on the other hand was optimized for fast execution on scalar ISAs with 32 GPRs. This is hard to vectorize in general, because GPR "moves" are free and liberally applied. Another example where it's probably worth specializing is quicksort, specifically the leaf part. I've written a VLA version, which uses bitonic sort to sort within vector registers. I wasn't able to meaningfully compare it against a fixed size implementation, because vqsort was super slow when I tried to compile it for RVV. |
|