▲ | camel-cdr 5 months ago | |
Btw, here is a VLA vector register sort: https://godbolt.org/z/Env64961q It has a few more instructions then the VLS version, but the critical dependency chain is the same. It's also slightly less optimal on x86, because it alway uses lane crossing permutes. For AVX512 that is 5 out of 15 permutations that are vperm, but could've been vshuf. (if the loop isn't unrolled and optimized by the compiler) I wasn't able to figure out how to implement the multi vector register sort in a VLA way. | ||
▲ | janwas 5 months ago | parent [-] | |
Nice work :) Clang x86 indeed unrolls, which is good. But setting the CC and AA mask constants looks fairly expensive compared to fixed-pattern shuffles. Yes, the 2D aspect of the sorting network complicates things. Transposing is already harder to make VLA and fusing it with the other shuffles certainly doesn't help. |