▲ | camel-cdr a day ago | ||||||||||||||||||||||
> we somewhat maxed out at 512 bits Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA. Also there are processors with larger vector length, e.g. 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200, 16384-bit: NEC SX-Aurora, Ara, EPI > no way around this You rarely need to rewrite SIMD code to take advantage of new extensions, unless somebody decides to create a new one with a larger SIMD width. This mostly happens when very specialized instructions are added. > In my experience, dynamic vector sizes make code slower, because they inhibit optimizations. Do you have more examples of this? I don't see spilling as much of a problem, because you want to avoid it regardless, and codegen for dynamic vector sizes is pretty good in my experience. > I don't think SVE delivered any large benefits Well, all Arm CPUs except for the A64FX were build to execute NEON as fast as possible. X86 CPUs aren't built to execute MMX or SSE or the latest, even AVX, as fast as possible. Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve > Performance was a lot better than I expected, giving between 14 and 63% uplift. Larger block sizes benefitted the most, as we get higher utilization of the wider vectors and fewer idle lanes. > I found the scale of the uplift somewhat surprising as Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so in terms of data-width the two should work out very similar. | |||||||||||||||||||||||
▲ | aengelke a day ago | parent | next [-] | ||||||||||||||||||||||
> Also there are processors with larger vector length How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU. > Do you have more examples of this? I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement. Compilers probably/hopefully have gotten better in the mean time and I haven't redone the experiments since then, but I'd be surprised if this changed drastically. Spilling is sometimes unavoidable, e.g. due to function calls. > Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve I was specifically referring to dynamic vector sizes. This experiment uses sizes fixed at compile-time, from the article: > For the astcenc implementation of SVE I decided to implement a fixed-width 256-bit implementation, where the vector length is known at compile time. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | vardump a day ago | parent | prev | next [-] | ||||||||||||||||||||||
> Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA. 256 and 512 bits are the only reasonable widths. 256 bit AVX2 is what, 13 or 14 years old now. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | codedokode 12 hours ago | parent | prev [-] | ||||||||||||||||||||||
> Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA. This is a wrong approach. You should be writing you code in a high-level language like this:
And let the compiler write the assembly for every existing architecture (including multi-threaded version of a loop).I don't understand what is the advantage of writing the SIMD code manually. At least have a LLM write it if you don't like my imaginary high-level vector language. | |||||||||||||||||||||||
|