Remix.run Logo
Prefix sums at gigabytes per second with ARM NEON(lemire.me)
53 points by mfiguiere 5 days ago | 7 comments
hayley-patton an hour ago | parent | next [-]

As not mentioned in the article, if you want the general form of this algorithm, it is a Hillis-Steele prefix sum: <https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorte...>

vardump 6 hours ago | parent | prev [-]

What's going on with SVE[2] support in the ARM land? It's weird that even Apple's M5 still doesn't support it (other than SME[2]).

adrian_b 2 hours ago | parent | next [-]

All the ARM cores designed by the Arm company and launched since 2022, which support variants of the Armv9-A ISA, support SVE2. This means that all medium price or high price smartphones that were introduced during the last 4 years have SVE2 support.

However, in embedded computers typically only extremely old cores are used, not newer than Cortex-A78 (2021), so these normally do not have SVE2 support. The exceptions are the new and extremely expensive NVIDIA Thor, intended for automotive applications (with Neoverse V3AE cores) and a CPU made by a Chinese company with Cortex-A720 cores, which is available in several single-board computers or Mini-ITX motherboards.

A few of the latest Arm-based server CPUs, for instance AWS Graviton5, support SVE2.

Apple seems to believe that the SVE2 ISA (derived from an extension of Aarch64 from Fujitsu) is not good, so they promote the SME/SME2 extension, which appears to be derived from a former proprietary ISA extension implemented in older Apple CPUs.

For single-thread applications, where Apple CPUs are better than the competition, SME2 can provide significantly higher performance than SVE2.

However, the SME2 performance for multi-threaded applications is much less impressive, not because the SME2 ISA has any defect, but because SME2 is executed in a separate dedicated core, which is shared by a cluster of normal CPU cores, so SME2 performance does not scale much when more cores are used, because a CPU might have only 1 SME2 core for each 4 or 8 normal cores.

This might contribute to the fact that the Apple CPUs have exceptional single-thread performance, but a multi-threaded performance that is not better than that of the competitors.

When I first heard about SVE/SVE2, I thought that it was great, but nowadays I am much less enthusiastic about it. I believe that the original goal, of writing programs that run on any CPU, regardless of the widths of its vector registers and of its vector execution units, is futile.

It is not possible to reach the maximum performance allowed by the hardware in a width-agnostic program. So now I believe that what is needed is not hardware support for ignoring the width, but better software tools that allow an easier writing of programs that are parametrized with hardware characteristics like the width of a cache line and the width of vector or matrix registers, from which a compiler should generate optimal code when the hardware parameters are given.

Even if I believe that SVE2 is not good enough to allow the programmer to ignore the implemented width, it still has some important improvements over the older Arm SIMD instructions, so it must be preferred on any CPU than supports it. When SME2 is available, like on Apple or on the latest generation of Arm cores launched in 2025, it is likely to be preferable to SVE2, unless latency is more important than throughput.

SME2 is intended to offer better throughput than SVE2 and better latency than the GPU. For maximum throughput, the GPU is preferable, if applicable. For minimum latency, SVE2 is the best.

my123 27 minutes ago | parent | next [-]

SME2 is restricted in scope to matrix multiply workloads and isn't really designed for anything else.

The point of streaming SVE is to have a way to pre/post process data on the way in or out of a matrix multiply.

A list of chips that I have around which support various levels of SVE:

For SVE(1) deployment, chips that have it: - Fujitsu A64fx - AWS Graviton3

SVE2: - Snapdragon X2, 8/8 Elite Gen 5 and later - MediaTek Dimensity 9000 and later - NVIDIA Tegra Thor and later, NVIDIA "N1" or later (GB10 is an "N1x" SKU) - Samsung Exynos 2200 or later - AWS Graviton4, Microsoft Cobalt 100, Google Axion (and newer chips) - CIX P1

SME(1) instead of SME2:

- Snapdragon X2, 8/8 Elite Gen 5

SME2:

- Apple M4, A18 and later - Samsung Exynos 2600 - MediaTek Dimensity 9500

Note that the Snapdragon 8/8 Elite Gen 5 and X2 support sve2 but not svebitperm.

dzaima an hour ago | parent | prev [-]

> This means that all medium price or high price smartphones that were introduced during the last 4 years have SVE2 support.

Except Qualcomm chipsets, which disable SVE even if all ARM cores used support it. ("Snapdragon 8 Elite Gen 5" supposedly finally supports SVE? but that's like only half a year old)

my123 35 minutes ago | parent [-]

Qualcomm was odd like that for a long time yeah.

And yes the Gen 5 chips (8, 8 Elite and X2) do implement SVE2 and SME.

nubinetwork 4 hours ago | parent | prev [-]

The radxa orion o6 apparently supports it...