Remix.run Logo
camel-cdr a day ago

> we somewhat maxed out at 512 bits

Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.

Also there are processors with larger vector length, e.g. 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200, 16384-bit: NEC SX-Aurora, Ara, EPI

> no way around this

You rarely need to rewrite SIMD code to take advantage of new extensions, unless somebody decides to create a new one with a larger SIMD width. This mostly happens when very specialized instructions are added.

> In my experience, dynamic vector sizes make code slower, because they inhibit optimizations.

Do you have more examples of this?

I don't see spilling as much of a problem, because you want to avoid it regardless, and codegen for dynamic vector sizes is pretty good in my experience.

> I don't think SVE delivered any large benefits

Well, all Arm CPUs except for the A64FX were build to execute NEON as fast as possible. X86 CPUs aren't built to execute MMX or SSE or the latest, even AVX, as fast as possible.

Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve

> Performance was a lot better than I expected, giving between 14 and 63% uplift. Larger block sizes benefitted the most, as we get higher utilization of the wider vectors and fewer idle lanes.

> I found the scale of the uplift somewhat surprising as Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so in terms of data-width the two should work out very similar.

aengelke a day ago | parent | next [-]

> Also there are processors with larger vector length

How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU.

> Do you have more examples of this?

I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement. Compilers probably/hopefully have gotten better in the mean time and I haven't redone the experiments since then, but I'd be surprised if this changed drastically. Spilling is sometimes unavoidable, e.g. due to function calls.

> Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve

I was specifically referring to dynamic vector sizes. This experiment uses sizes fixed at compile-time, from the article:

> For the astcenc implementation of SVE I decided to implement a fixed-width 256-bit implementation, where the vector length is known at compile time.

camel-cdr a day ago | parent [-]

> How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU.

The NEC is an attached accelerator, but IIRC it can run an OS in host mode. It's hard to tell how the others perform, because most don't have hardware available yet or only they and partner companies have access. It's also hard to compare, because they don't target the desktop market.

> I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement.

Oh, wow. Was this autovectorized or handwritten intrinsics/assembly?

Any chance it's of a small enough scope that I could try to recreate it?

> I was specifically referring to dynamic vector sizes.

Ah, sorry, yes you are correct. It still shows that supporting VLA mechanisms in an ISA doesn't mean it's slower for fixed-size usage.

I'm not aware of any proper VLA vs VLS comparisons. I benchmarked a VLA vs VLS mandelbrot implementation once where there was no performance difference, but that's a too simple example.

vardump a day ago | parent | prev | next [-]

> Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.

256 and 512 bits are the only reasonable widths. 256 bit AVX2 is what, 13 or 14 years old now.

adgjlsfhk1 a day ago | parent [-]

no. Because Intel is full of absolute idiots, Intel atom didn't support AVX 1 until Gracemont. Tremont is missing AVX1, AVX2, FMA, and basically the rest of X86v3, and shipped in CPUs as recently as 2021 (Jasper Lake).

ack_complete 18 hours ago | parent | next [-]

Intel also shipped a bunch of Pentium-branded CPUs that have AVX disabled, leading to oddities like a Kaby Lake based CPU that doesn't have AVX, and even worse, also shipped a few CPUs that have AVX2 but not BMI2:

https://sourceware.org/bugzilla/show_bug.cgi?id=29611

https://developercommunity.visualstudio.com/t/Crash-in-Windo...

vardump a day ago | parent | prev [-]

Oh damn. I've dropped SSE ages ago and no one complained. I guess the customer base didn't use those chips...

codedokode 12 hours ago | parent | prev [-]

> Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.

This is a wrong approach. You should be writing you code in a high-level language like this:

    x = sum i for 1..n: a[i] * b[i]
And let the compiler write the assembly for every existing architecture (including multi-threaded version of a loop).

I don't understand what is the advantage of writing the SIMD code manually. At least have a LLM write it if you don't like my imaginary high-level vector language.

otherjason 6 hours ago | parent [-]

This is the common argument from proponents of compiler autovectorization. An example like what you have is very simple, so modern compilers would turn it into SIMD code without a problem.

In practice, though, the cases that compilers can successfully autovectorize are very limited relative to the total problem space that SIMD is solving. Plus, if I rely on that, it leaves me vulnerable to regressions in the compiler vectorizer.

Ultimately for me, I would rather write the implementation myself and know what is being generated versus trying to write high-level code in just the right way to make the compiler generate what I want.