Remix.run Logo
jcranmer 5 months ago

There's a category of autovectorization known as Superword-Level Parallelism (SLP) which effectively scavenges an entire basic block for individual instruction sequences that might be squeezed together into a SIMD instruction. This kind of vectorization doesn't work well with vector-length-agnostic ISAs, because you generally can't scavenge more than a few elements anyways, and inducing any sort of dynamic vector length is more likely to slow your code down as a result (since you can't do constant folding).

There's other kinds of interesting things you can do with vectors that aren't improved by dynamic-length vectors. Something like abseil's hash table, which uses vector code to efficiently manage the occupancy bitmap. Dynamic vector length doesn't help that much in that case, particularly because the vector length you can parallelize over is itself intrinsically low (if you're scanning dozens of elements to find an empty slot, something is wrong). Vector swizzling is harder to do dynamically, and in general, at high vector factors, difficult to do generically in hardware, which means going to larger vectors (even before considering dynamic sizes), vectorization is trickier if you have to do a lot of swizzling.

In general, vector-length-agnostic is really only good for SIMT-like codes, which you can express the vector body as more or less independent f(index) for some knowable-before-you-execute-the-loop range of indices. Stuff like DAXPY or BLAS in general. Move away from this model, and that agnosticism becomes overhead that doesn't pay for itself. (Now granted, this kind of model is a large fraction of parallelizable code, but it's far from all of it).

jonstewart 5 months ago | parent | next [-]

A number of the cool string processing SIMD techniques depend a _lot_ on register widths and instruction performance characteristics. There’s a fair argument to be made that x64 could be made more consistent/legible for these use cases, but this isn’t matmul—whether you have 128, 256, or 512 bits matters hugely and you may want entirely different algorithms that are contingent on this.

camel-cdr 5 months ago | parent | prev | next [-]

The SLP vectorizer is a good point, but I think it's, in comparison with x86, more a problem of the float and vector register files not being shared (in SVE and RVV). You don't need to reconfigure the vector length; just use it at the full width.

> Something like abseil's hash table

If I remember this correctly, the abseil lookup does scale with vector length, as long as you use the native data path width. (albeit with small gains) There is a problem with vector length agnostic handling of abseil, which is the iterator API. With a different API, or compilers that could eliminate redundant predicated load/stores, this would be easier.

> good for SIMT-like codes

Certainly, but I've also seen/written a lot of vector length agnostic code using shuffles, which don't fit into the SIMT paradigm, which means that the scope is larger than just SIMT.

---

As a general comparison, take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, remove the few instructions that don't make sense anymore, and add a cheap instruction to query the vector length. (probably also instructions like vid and viota, for easier shuffle synthesization) Now you have a variable-length SIMD ISA that feels familiar.

The above is basically what SVE is.

janwas 5 months ago | parent [-]

(For other readers:) This is what our Highway library does - wrapper functions around intrinsics, plus a (constexpr if possible) Lanes() function to query the length.

For very many cases, writing the code once for an 'unknown to the programmer' vector length indeed works.

One example that doesn't work so well is a sorting network; its size depends on the vector length. (I see you mention this below.)

camel-cdr 5 months ago | parent [-]

I quite like highway.

As mentioned, last time I tried vqsort for RVV it was surprisingly slow.

I tried to replicate it yesterday, but noticed that vqsort is now disabled for RVV: https://github.com/google/highway/blob/400fbf20f2e40b984be12...

Does highway support sorting networks for non-128-bit vector registers?

When I tried to compile it for AVX512, the BaseCase seems to only use xmm registers: https://godbolt.org/z/qr9xoTGKn

janwas 5 months ago | parent [-]

:) Yes, vqsort recently tickled a bug in clang. I've seen a steady stream of issues, many caused by SLP or the seeming absence of CI. You might try re-enabling it on GCC.

Yes, the issue with the sorting network is that it is limited to 16x16 to reduce code explosion. With uint16_t, XMM are sufficient for the 8-column case; your Godbolt link does have some YMM for the 16-column case. When changing the type to sort to uint32_t, we see ZMM as expected.

camel-cdr 5 months ago | parent [-]

Btw, here is a VLA vector register sort: https://godbolt.org/z/Env64961q

It has a few more instructions then the VLS version, but the critical dependency chain is the same.

It's also slightly less optimal on x86, because it alway uses lane crossing permutes. For AVX512 that is 5 out of 15 permutations that are vperm, but could've been vshuf. (if the loop isn't unrolled and optimized by the compiler)

I wasn't able to figure out how to implement the multi vector register sort in a VLA way.

janwas 5 months ago | parent [-]

Nice work :) Clang x86 indeed unrolls, which is good. But setting the CC and AA mask constants looks fairly expensive compared to fixed-pattern shuffles.

Yes, the 2D aspect of the sorting network complicates things. Transposing is already harder to make VLA and fusing it with the other shuffles certainly doesn't help.

janwas 5 months ago | parent | prev [-]

I advised the Abseil design and regret not pointing this out earlier: changing the interface to insert/query batches of items would be considerably more efficient, especially for databases. Whenever possible, 'vertical' algorithms (independent SIMD elements) usually scale better than 'horizontal' (pick one element within a vector).