Remix.run Logo
aengelke a day ago

I agree; and the article seems to have also quite a few technical flaws:

- Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with ~~>256~~ >512 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs.

- The post mentions available opcode space: while opcode space is limited, a reasonably well-designed ISA (e.g., AArch64) has enough holes for extensions. Adding new instructions doesn't require ABI changes, and while adding new registers requires some kernel changes, this is well understood at this point.

- "What is worse, software developers often have to target several SIMD generations" -- no way around this, though, unless auto-vectorization becomes substantially better. Adjusting the register width is not the big problem when porting code, making better use of available instructions is.

- "The packed SIMD paradigm is that there is a 1:1 mapping between the register width and the execution unit width" -- no. E.g., AMD's Zen 4 does double pumping, and AVX was IIRC originally designed to support this as well (although Intel went directly for 256-bit units).

- "At the same time many SIMD operations are pipelined and require several clock cycles to complete" -- well, they are pipelined, but many SIMD instructions have the same latency as their scalar counterpart.

- "Consequently, loops have to be unrolled in order to avoid stalls and keep the pipeline busy." -- loop unroll has several benefits, mostly to reduce the overhead of the loop and to avoid data dependencies between loop iterations. Larger basic blocks are better for hardware as every branch, even if predicted correctly, has a small penalty. "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling.

In my experience, dynamic vector sizes make code slower, because they inhibit optimizations. E.g., spilling a dynamically sized vector is like a dynamic stack allocation with a dynamic offset. I don't think SVE delivered any large benefits, both in terms of performance (there's not much hardware with SVE to begin with...) and compiler support. RISC-V pushes further into this direction, we'll see how this turns out.

camel-cdr a day ago | parent | next [-]

> we somewhat maxed out at 512 bits

Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.

Also there are processors with larger vector length, e.g. 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200, 16384-bit: NEC SX-Aurora, Ara, EPI

> no way around this

You rarely need to rewrite SIMD code to take advantage of new extensions, unless somebody decides to create a new one with a larger SIMD width. This mostly happens when very specialized instructions are added.

> In my experience, dynamic vector sizes make code slower, because they inhibit optimizations.

Do you have more examples of this?

I don't see spilling as much of a problem, because you want to avoid it regardless, and codegen for dynamic vector sizes is pretty good in my experience.

> I don't think SVE delivered any large benefits

Well, all Arm CPUs except for the A64FX were build to execute NEON as fast as possible. X86 CPUs aren't built to execute MMX or SSE or the latest, even AVX, as fast as possible.

Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve

> Performance was a lot better than I expected, giving between 14 and 63% uplift. Larger block sizes benefitted the most, as we get higher utilization of the wider vectors and fewer idle lanes.

> I found the scale of the uplift somewhat surprising as Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so in terms of data-width the two should work out very similar.

aengelke a day ago | parent | next [-]

> Also there are processors with larger vector length

How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU.

> Do you have more examples of this?

I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement. Compilers probably/hopefully have gotten better in the mean time and I haven't redone the experiments since then, but I'd be surprised if this changed drastically. Spilling is sometimes unavoidable, e.g. due to function calls.

> Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve

I was specifically referring to dynamic vector sizes. This experiment uses sizes fixed at compile-time, from the article:

> For the astcenc implementation of SVE I decided to implement a fixed-width 256-bit implementation, where the vector length is known at compile time.

camel-cdr a day ago | parent [-]

> How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU.

The NEC is an attached accelerator, but IIRC it can run an OS in host mode. It's hard to tell how the others perform, because most don't have hardware available yet or only they and partner companies have access. It's also hard to compare, because they don't target the desktop market.

> I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement.

Oh, wow. Was this autovectorized or handwritten intrinsics/assembly?

Any chance it's of a small enough scope that I could try to recreate it?

> I was specifically referring to dynamic vector sizes.

Ah, sorry, yes you are correct. It still shows that supporting VLA mechanisms in an ISA doesn't mean it's slower for fixed-size usage.

I'm not aware of any proper VLA vs VLS comparisons. I benchmarked a VLA vs VLS mandelbrot implementation once where there was no performance difference, but that's a too simple example.

vardump a day ago | parent | prev | next [-]

> Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.

256 and 512 bits are the only reasonable widths. 256 bit AVX2 is what, 13 or 14 years old now.

adgjlsfhk1 a day ago | parent [-]

no. Because Intel is full of absolute idiots, Intel atom didn't support AVX 1 until Gracemont. Tremont is missing AVX1, AVX2, FMA, and basically the rest of X86v3, and shipped in CPUs as recently as 2021 (Jasper Lake).

ack_complete 17 hours ago | parent | next [-]

Intel also shipped a bunch of Pentium-branded CPUs that have AVX disabled, leading to oddities like a Kaby Lake based CPU that doesn't have AVX, and even worse, also shipped a few CPUs that have AVX2 but not BMI2:

https://sourceware.org/bugzilla/show_bug.cgi?id=29611

https://developercommunity.visualstudio.com/t/Crash-in-Windo...

vardump a day ago | parent | prev [-]

Oh damn. I've dropped SSE ages ago and no one complained. I guess the customer base didn't use those chips...

codedokode 11 hours ago | parent | prev [-]

> Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.

This is a wrong approach. You should be writing you code in a high-level language like this:

    x = sum i for 1..n: a[i] * b[i]
And let the compiler write the assembly for every existing architecture (including multi-threaded version of a loop).

I don't understand what is the advantage of writing the SIMD code manually. At least have a LLM write it if you don't like my imaginary high-level vector language.

otherjason 5 hours ago | parent [-]

This is the common argument from proponents of compiler autovectorization. An example like what you have is very simple, so modern compilers would turn it into SIMD code without a problem.

In practice, though, the cases that compilers can successfully autovectorize are very limited relative to the total problem space that SIMD is solving. Plus, if I rely on that, it leaves me vulnerable to regressions in the compiler vectorizer.

Ultimately for me, I would rather write the implementation myself and know what is being generated versus trying to write high-level code in just the right way to make the compiler generate what I want.

bjourne 17 hours ago | parent | prev | next [-]

> "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling.

No, it actually is super common in hpc code. If you unroll a loop N times you need N times as many registers. For normal memory-bound code I agree with you, but most hpc kernels will exploit as much of the register file as they can for blocking/tiling.

xphos 15 hours ago | parent | prev | next [-]

I think the variable length stuff does solve encoding issues, and RISCV takes so big strides with the ideas around chaining and vl/lmul/vtype registers.

I think they would benefit from having 4 vtype registers, though. It's wasted scalar space, but how often do you actually rotate between 4 different vector types in main loop bodies. The answer is pretty rarely. And you'd greatly reduce the swapping between vtypes when. I think they needed to find 1 more bit but it's tough the encoding space isn't that large for rvv which is a perk for sure

Can't wait to seem more implementions of rvv to actually test some of it's ideas

dzaima 15 hours ago | parent [-]

If you had two extra bits in the instruction encoding, I think it'd make much more sense to encode element width directly in instructions, leaving LMUL multiplier & agnosticness settings in vsetvl; only things that'd suffer then would be if you need tail-undisturbed for one instr (don't think that's particularly common) and fancy things that reinterpret the vector between different element widths (very uncommon).

Will be interesting to see if longer encodings for RVV with encoded vtype or whatever ever materialize.

cherryteastain a day ago | parent | prev | next [-]

Fujitsu A64FX used in the Fugaku supercomputer uses SVE with 512 bit width

aengelke a day ago | parent [-]

Thanks, I misremembered. However, the microarchitecture is a bit "weird" (really HPC-targeted), with very long latencies (e.g., ADD (vector) 4 cycles, FADD (vector) 9 cycles). I remember that it was much slower than older x86 CPUs for non-SIMD code, and even for SIMD code, it took quite a bit of effort to get reasonable performance through instruction-level parallelism due to the long latencies and the very limited out-of-order capacities (in particular the just 2x20 reservation station entries for FP).

deaddodo a day ago | parent | prev [-]

> - Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with >256 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs.

Just to address this, it's pretty evident why scalar values have stabilized at 64-bit and vectors at ~512 (though there are larger implementations). Tell someone they only have 256 values to work with and they immediately see the limit, it's why old 8-bit code wasted so much time shuffling carries to compute larger values. Tell them you have 65536 values and it alleviates a large set of that problem, but you're still going to hit limits frequently. Now you have up to 4294967296 values and the limits are realistically only going to be hit in computational realms, so bump it up to 18446744073709551615. Now even most commodity computational limits are alleviated and the compiler will handle the data shuffling for larger ones.

There was naturally going to be a point where there was enough static computational power on integers that it didn't make sense to continue widening them (at least, not at the previous rate). The same goes for vectorization, but in even more niche and specific fields.