| ▲ | Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets(arxiv.org) | |||||||||||||
| 65 points by mpweiher a day ago | 7 comments | ||||||||||||||
| ▲ | ridiculous_fish 2 hours ago | parent | next [-] | |||||||||||||
This paper missed the state of the art!!! This concerns unsigned division by a 32 bit constant divisor. Compilers routinely optimize this to a high-multiply by a "magic number" but this number may be up to 33 bits (worst-case, divisor dependent, 7 is a problem child). So you may need a 32x33-bit high multiply. What compilers do today is some bit tricks to perform a 32x33 bit high multiply through a 32x32 multiply and then handling the last bit specially, through additions and bitshifts. That's the "GM Method" in the paper; the juice to be squeezed out is the extra stuff to handle the 33rd bit. What the paper observes is that 32x33 high multiply can be performed via a 64x64 high multiply and then the extra stuff goes away. Well yes, of course. But amazingly in ALL of these worst case situations you can compute a different magic number, and perform a (saturating) increment of the dividend at runtime, and ONLY need a 32 bit high multiply. That is, these are the two algorithms for unsigned division by constants where the magic number overflows: - This paper: Promote to twice the bit width, followed by high multiply (32x64 => 64) and then bitshift - SOTA: Saturating increment of the dividend, then high multiply (32x32 => 32) and then bitshift Probably the paper's approach is a little better on wide multipliers, but they missed this well-known technique published 15 years ago (and before that, I merely rediscovered it): https://ridiculousfish.com/blog/posts/labor-of-division-epis... | ||||||||||||||
| ||||||||||||||
| ▲ | foltik 4 hours ago | parent | prev | next [-] | |||||||||||||
Compilers already optimize division of a 32 bit integer x by a constant d like:
And /2^a is equivalent to >>a, so we have:
Which is just a multiply and shift, cheaper than integer division.Problem is, with some divisors like 7, 14, 21, ... the smallest c that works to produce an exact result is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift, add. What the paper points out is twofold: 1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround. 2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get >> 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:
Now (x * c’) >> 64 is equivalent to (x * c) >> a. And the result is just waiting for you in a 64 bit register, no shift needed.Just one multiply to divide a 32 bit integer by 7. Nice. | ||||||||||||||
| ▲ | purplesyringa 2 hours ago | parent | prev | next [-] | |||||||||||||
I must admit I'm surprised to see this -- Lemire offhandedly mentioned in the famous remainder blog post (https://lemire.me/blog/2019/02/08/faster-remainders-when-the...) that 64-bit constants can be used for 32-bit division, and even provided a short example to compute the remainder that way (though not the quotient). Looking a bit more, it seems like libdivide didn't integrate this optimization either. I guess everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented? | ||||||||||||||
| ▲ | gopalv 3 hours ago | parent | prev | next [-] | |||||||||||||
The linked clang PR is also very readable. https://github.com/llvm/llvm-project/pull/181288/files As the PR clearly points out, you can do this in a register but not inside vectors. I don't think fastdiv has had an update in years, which what I've used because compilers can't do "this is a constant for the next loop of 1024" like columnar sql needs. | ||||||||||||||
| ▲ | pkaye 2 hours ago | parent | prev [-] | |||||||||||||
Look up the book Hackers Delight if you like there low level algorithms and bit manipulations | ||||||||||||||