| ▲ | sparkie 4 hours ago | |
SIMD is for performing parallel operations on many smaller types. It can help with some cryptography, but It doesn't necessarily help when performing single arithmetic operations on larger types. Though it does help when performing logic and shift operations on larger types. If we were performing 128-bit arithmetic in parallel over many values, then a SIMD implementation may help, but without a SIMD equivalent of `addcarry`, there's a limit to how much it can help. Something like this could potentially be added to AVX-512 for example by utilizing the `k` mask registers for the carries. The best we have currently is `adcx` and `adox` which let us use two interleaved addcarry chains, where one utilizes the carry flag and the other utilizes the overflow flag, which improves ILP. These instructions are quite niche but are used in bigint libraries to improve performance. | ||
| ▲ | wahern an hour ago | parent [-] | |
> but It doesn't necessarily help when performing single arithmetic operations on larger types. For the curious, AFAIU the problem is the dependency chains. For example, for simple bignum addition you can't just naively perform all the adds on each limb in parallel and then apply the carries in parallel; the addition of each limb depends on the carries from the previous limbs. Working around these issues with masking and other tricks typically ends up adding too many additional operations, resulting in lower throughput than non-SIMD approaches. There's quite a few papers on using SIMD to accelerate bignum arithmetic for single operations, but they all seem quite complicated and heavily qualified. The threshold for eeking out any gain is quite high, e.g. minimum 512-bit numbers or much greater, depending. And they tend to target complex or specialized operations (not straight addition, multiplication, etc) where clever algebraic rearrangements can profitably reorder dependency chains for SIMD specifically. | ||