Remix.run Logo
camel-cdr a day ago

BLAS, specifically gemm, is one of the rare things where you naturally need to specialize on vector register width.

Most problems don't require this: E.g. your basic penalizable math stuff, unicode conversion, base64 de/encode, json parsing, set intersection, quicksort*, bigint, run length encoding, chacha20, ...

And if you run into a problem that benefits from knowing the SIMD width, then just specialize on it. You can totally use variable-length SIMD ISA's in a fixed-length way when required. But most of the time it isn't required, and you have code that easily scales between vector lengths.

*quicksort: most time is spent partitioning, which is vector length agnostic, you can handle the leafs in a vector length agnostic way, but you'll get more efficient code if you specialize (idk how big the impact is, in vreg bitonic sort is quite efficient).