▲ | charleslmunger 5 days ago | |
I tried implementing an optimized varint encoder that did something similar, by populating an 8 byte value and then storing it to ram, but unaligned overlapping stores caused big regressions. The approach that worked required a different technique. This one is for encoding backwards: 1. One branch for the one-byte case, always inlined, just store the byte 2. Calculate the size size of the unencoded zero prefix with a branch-free construction: (((uint32_t)clz + 7) * 9) >> 6 3. Hand roll a jump table taking advantage of arm's fixed instruction width to calculate the jump target with a shift. https://github.com/protocolbuffers/protobuf/commit/b039dfe26... This results in one branch for 1 byte varints, plus one additional branch for any larger size, and the branch predictor does not have to track a varying trip count through a loop. This approach resulted in a 2.64% speedup for overall encoding (which includes a lot more than just varints) on mid sized arm cores. I think it's very difficult to beat a single comparison and branch for the 1 byte case for actual encoding forwards or backwards, unless you know there's going to be long runs of one-byte values. |