▲ | akoboldfrying 3 days ago | ||||||||||||||||
Your "Branchless" approach could indeed be implemented very efficiently in a CPU with AVX2 (256-bit-wide vectors). With the current element in rax, the 4 valid values in ymm2, and the 4 running totals in ymm3 (initially zero), the inner loop would be just:
VPBROADCASTQ copies rax into each of the 4 lanes in ymm1. The VPCMPEQQ sets each qword there to all-0 ( = 0) or all-1 ( = -1) depending on the comparison result, so the VPADDQ will accumulate 4 running negative totals into ymm3, which can be negated afterwards.I would still expect the perfect hash function approach to be faster, though -- a similar number of operations, but 25% of the memory movement. | |||||||||||||||||
▲ | Sesse__ 3 days ago | parent [-] | ||||||||||||||||
“Memory movement”? None of the instructions you list involve memory. I find the perfect hash implementation a bit weird; it seems to obfuscate that you simply look at the lowest two bits (since they differ between the four values). You can do the x + 3 and 3 - expr at the very end, once, instead of for every element. | |||||||||||||||||
|