SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side. RISC-V did the same thing with RVV, for better or worse.

▲

camel-cdr 4 hours ago | parent | next [-]

> SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side.

You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.

If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.

▲

jsheard 3 hours ago | parent [-]

> You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time.

ARM seems to be proposing a C language extension which does require compilers to support variably sized types, but it's not clear to me how the implementation of that is going, and equivalent support in other languages like Rust seems basically non-existent for now.

▲

camel-cdr 3 hours ago | parent | next [-]

> Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time

Yes, you can't, which is annoying, but you can if you compile for a specific vector length.

This is mostly a library structure problem. E.g. simdjson has a generic backend that assumes a fixed vector length. I've written fixed width RVV support for it. A vector length agnostic backend is also possible, but requires writing a full new backend. I'm planning to write it in the future (I alreasy have a few json::minify implementations), but it will be more work. If the generic backend used a SIMD abstraction, like highway, that support scalable vectors this wouldn't be a problem.

Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.

	▲	jsheard 3 hours ago \| parent [-]
		> Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs. SVE theoretically supports hardware up to 2048-bit, so conservatively reserving the worst-case size at compile time would be pretty wasteful. That's 16x overhead in the base case of 128-bit hardware.

▲

pertymcpert 33 minutes ago | parent | prev [-]

You can definitely SVE vectors on the stack, there are special instructions to load and store with variable offsets. What you can't do is to put them into structs which need to have concretely sized types (i.e. subsequent element offset need to have a known byte offset).

▲

Tuldok 5 hours ago | parent | prev | next [-]

The only time I've encountered ARM SVE being used in the wild is in the FEX x86 emulator (https://fex-emu.com/FEX-2407/).

▲

kbolino 5 hours ago | parent | prev | next [-]

Yeah, the extensions exist, and as pointed out by a sibling comment to yours, have been implemented in supercomputer cores made by Fujitsu. However, as far as I know, neither Apple nor Qualcomm have made any desktop cores with SVE support. So the biggest reason there's no desktop software for it is because there's no hardware support.

▲

jsheard 5 hours ago | parent | next [-]

ARMs Neoverse IP does support SVE, so it's at least already relevant in cloud applications. Apparently AWS Graviton3 had 256bit SVE, but Graviton4 regressed back to 128bit SVE for some reason?

https://ashvardanian.com/posts/aws-graviton-checksums-on-neo...

	▲	camel-cdr 4 hours ago \| parent [-]
		The problem with SVE is that ARM vendors need to make NEON as fast as possible to stay competitive, so there is little incentive to implement SVE with wider vectors. Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast. Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.

▲

my123 4 hours ago | parent | prev | next [-]

The Oryon 3rd gen in the Snapdragon X2 has SVE2 (as does NVIDIA N1x, currently pre-launched of sorts on the DGX Spark)

▲

justincormack 4 hours ago | parent | prev [-]

I think the CIX P1 has support, but I havent got one yet to verify, this is a cheap SOC core.

▲

otherjason 4 hours ago | parent | prev | next [-]

The only CPU I've encountered that supports SVE is the Cortex-X925/A725 that is used in the NVIDIA DGX Spark platform. The vector width is still only 128 bits, but you do get access to the other enhancements the SVE instructions give, like predication (one of the most useful features from Intel's AVX512).

▲

0x000xca0xfe 4 hours ago | parent | prev [-]

RISC-V chip designers at least seem to be more bullish on vectors. There is seriously cool stuff coming like the SpacemiT K3 with 1024-bit vectors :)

▲

camel-cdr 4 hours ago | parent | next [-]

The 1024-bit RVV cores in the K3 are mostly that size to feed a matmul engine. While the vector registers are 1024-bit, the two exexution units are only 256-bit wide.

The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.

But yes, RVV already has more diverse vector width hardware than SVE.

	▲	0x000xca0xfe 2 hours ago \| parent [-]
		It's a low clocked (2.1GHz) dual-issue in-order core so obviously nowhere near the real-world performance of e.g. Zen5 which can retire multiple 256-bit or even 512-bit vector instructions per cycle at 5+ GHz. But I find the RVV ISA just really fascinating. Grouping 8 1024-bit registers together gives us 8192-bit or 1-kilobyte registers! That's a tremendous amount of work that can be done using a single instruction. Feels like the Lanz bulldog of CPUs. Not sure how practical it will be after all, but it's certainly interesting.

▲

4 hours ago | parent | prev [-]

[deleted]