| ▲ | stevefan1999 7 hours ago | |
As someone who used std::simd in an attempt for submitting to an academic conference CFP*, I have look deep into how std::simd and I would conclude that there are a couple of reasons it isn't stable yet (this is rather long and maybe need 10 minutes to read): 1. It is highly depending on LLVM intrinsics which itself can change quite a lot. Sometimes the intrinsic would even fail to instantiate and crashed the entire compilation. I for example met chronic ICE crashes for the same code in different nightly Rust version. Then I realize it is because the SIMD operation was too complicated and I need to simplify it, and sometimes need to stop recursing and expanding too much to prevent stack spilling and exhausting register allocation. This happens from time to time especially when using std::simd with embedded target where registers are scarcity. 2. Some hardware design decisions making SIMD itself not ergonomic and hard to generalize, this is also reflected on the design of std::simd as well. Recall that SIMD techniques stems from vector processors in supercomputers from the likes of Cray and IBM, that is from the 70s and back then computation and hardware design was primitive and simple, so they have fixed vector size. The ancient design is very stable, and is still kept till this day, even with the likes of AVX2, AVX512, VFP and NEON, so this influenced the design of things like lane count (https://doc.rust-lang.org/std/simd/struct.LaneCount.html). But the plot twist: as time goes on, it turns out that modern SIMD is now capable of doing variable sizes; RISC-V's SIMD extension is one such implementation for example. So now we come to a dilemma on to keep the existing fixed lane count design, or allow it to extend further. If we allow it to extend further to cater for things like variable-SIMD vector length, then we need to wait for generic_const_exprs to be stable, and right now it is not only not stable but incomplete too (https://github.com/rust-lang/portable-simd/issues/416). This is a hard design philosophical change and is not easy to deal with. Time will tell. 3. As an extension to #2, the way that thinking in SIMD is hard in the very first place, and to use it in production you even have to think about different situations. This come in the form of dynamic dispatch, and it is a pain to dealt with, although we have great helpers such as multiversion...it is still very hard to design an interface that scales. Take Google's highway (https://github.com/google/highway/blob/master/g3doc/quick_re...) for example, it is the library to write portable SIMD code with dynamic dispatch in C++, but in an esotheric and not so ergonomic way. How we could do better with std::simd is still a myth. How do you abstract the idea of scatter-gather operation? What the heck is swizzle? Why do we call it shuffle and not permutation. Lots of stuff to learn, that means lots of pain to go through. 4. Plus, when you think in SIMD, there could be multiple instructions and multiple ways to do the same thing, one maybe more efficient than the other. For example, as I have to touch some finite field stuff in GF(2^8), there are few ways to do finite field multiplication: a. Precomputed table lookup b. Russian Peasant Multiplication (basically carryless Karatsuba multiplication, but oftenly reduce to the form of table lookups as well, can also seen as ripple counter with modulo arithmetic except carry has to be delivered in a different way) c. Do an inner product and then do Barrett reduction (https://www.esat.kuleuven.be/cosic/publications/article-1115...) d. Or just treat it as multiplcation over a polynominal power series but this essentially mean we treat it as a finite field convolution, which I suspect is highly related to fourier transform. (https://arxiv.org/pdf/1102.4772) e. Use the somewhat new GF2P8AFFINEQB (https://www.felixcloutier.com/x86/gf2p8affineqb) from GFNI which, contrary to most people who think it is available for AVX512 only, but is actually available for SSE/AVX/AVX2 as well (this is called GFNI-SSE in gcc), so it works on my 13600KF too (except obviously I cannot use ZMM registers or I just get illegal instruction for any instructions that touches ZMM or uses the EVEX encoding). I have an internal implementation of finite field multiplication using just that, but I need to use the polynomial of 0x11D rather than 0x11B so GF2P8MULB (https://www.felixcloutier.com/x86/gf2p8mulb) is out of question (which is supposed to be the fastest in the world theoretically if we can use arbitary polynomial), but this is rather hard to understand and explain in the first place. (by the way I used SIMDE for that: https://github.com/simd-everywhere/simde) All of these can be done in SIMD, but each one of these methods have its pros and cons. Table lookup maybe fast and seemingly O(1) but you actually need to keep the table in cache, meaning we trade time with space, and SIMD would amplify the cache thrashing from multiple access. This could slow down the CPU pipeline although modern CPU are clever enough on cache management. If you want to do Russian Peasant Multiplication then you need a bunch of loops to go through the division and XOR chunk by chunk. If you want Barrett reduction then you need to have efficient carryless multiplication such as PCLMULQDQ (https://www.felixcloutier.com/x86/pclmulqdq), to do the inner product and reduce the polynomial. Or a more primitive way find ways to do finite field Horner's method in SIMD... How to think in SIMD is already hard as said in #3. How to balance in SIMD like this is even harder. Unless you want to have a certain edge, or want to shatter the benchmark, I would say SIMD is not a good investment. You need to use SIMD at the right scenario at the right time. SIMD is useful, but also kind of niche, and modern CPU is optimized well enough that the performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether, and most of the time if you want to use SIMD, using the easiest SIMD methods is generally enough. *: I myself used std::simd intensively in my own project, well it got refused that the paper was actually severely lacking in literature studies, but that I shouldn't have used LLM too much to generate the paper. However, the code was here (https://github.com/stevefan1999-personal/sigmah). Now I have a new approach to this problem that is derived from my current work with finite field, error correction, divide and conquer and polynominal multiplication, and I plan to resubmit the paper once I have time to clear it, with a more careful approach next time too, although the problem of string matching with don't care can be seen as convolution and I doubt my approach would ended up something like that...making the paper still unworthy for acceptance. | ||
| ▲ | janwas 9 minutes ago | parent | next [-] | |
> performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether A quick comment on this one point (personal opinion): from a hyperscalar perspective, scalar code is most certainly not enough. The energy cost from scheduling a MUL instruction is something like 10x of the actual operation it performs. It is important to amortize that cost over many elements (i.e. SIMD). | ||
| ▲ | eden-u4 an hour ago | parent | prev [-] | |
wow, thanks for this long explanation. | ||