▲ | ashvardanian 5 days ago | ||||||||||||||||||||||||||||
I like that more people are getting involved with SIMD, and there have been several posts lately on both memmem-like and memcpy-like operations implemented in SIMD in different programming languages. In most cases, though, these still focus on AVX/NEON instructions from over 10 years ago, rather than newer and more powerful AVX-512 variations, SVE & SVE2, or RVV. These newer ISAs can noticeably change how one would implement a state-of-the-art substring search or copy/move operation. In my projects, such as StringZilla, I often use mask K registers (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...) and an input-dependent mix of temporal and non-temporal loads and stores (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...). In typical cases, the difference between the suggested SIMD kernels and the state-of-the-art can be as significant as 50% in throughput. As SIMD becomes more widespread, it would be beneficial to focus more on delivering software and bundling binaries, rather than just the kernels. | |||||||||||||||||||||||||||||
▲ | ack_complete 5 days ago | parent | next [-] | ||||||||||||||||||||||||||||
Sure, but I have to support a range of target CPUs in the consumer desktop market, and the older CPUs are the ones that need optimizations the most. That means NEON on ARM64 and AVX2 or SSE2-4 on x64. Time spent on higher vector instruction sets benefits a smaller fraction of the user base that already has better performance, and that's especially problematic if the algorithm has to be reworked to take best advantage of the higher extensions. AVX-512 is also in bad shape market-wise, despite its amazing feature set and how long it's been since initial release. The Steam Hardware Survey, which skews toward the higher end of the market, only shows 18% of the user base having AVX-512 support. And even that is despite Intel's best efforts to reverse progress by shipping all new consumer CPUs with AVX-512 support disabled. | |||||||||||||||||||||||||||||
▲ | moregrist 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
I’m not as familiar with the NEON side, but AVX512 support is pretty variable on new processors. Alder Lake omits it entirely. So we’re still in a world where AVX2 is the lowest common denominator for a system library that wants wide support. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
▲ | ashvardanian 5 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
PS: Finding CPUs that support AVX-512 and SVE is relatively trivial - practically every cloud has them by now. It's harder to find Arm CPUs with wide physical registers, but that's another story. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
▲ | nromiun 5 days ago | parent | prev [-] | ||||||||||||||||||||||||||||
Because it is very hard to find new hardware to test it, let alone expect your users to take advantage of it on their machines. AVX512 is such a mess that Intel just removed it after a generation or two. And on ARM SVE side it is even worse. There is already SVE2, but good luck finding even a SVE enabled machine. Apple does not support it on their Apple Silicon™ (only SME), Snapdragon does not support it even on their latest 8 Elite. 8 Elite Gen 2 is supposed to come with it. Only Mediatek and Neoverse chips support them. So finding one machine to develop and test such code can be a little difficult. |