▲ | menaerus 4 days ago | |
The issue was that they were hand-optimizing the SIMD for a workload that doesn't even exist in their case so they ended up seeing one big nothing once they profiled the code with a new optimization - a classic trap with 95% "optimizations". It turned out that the distribution of their data is in those 1 and 2 byte LEBs, and in that case their SIMD approach doesn't give the gains as it does with 9 and 10 byte LEBs. I am wondering more about the bit-by-bit compatibility part from "This was verified in a benchmark that ran both versions on billions of random numbers, confirming both the huge speedup and the bit-for-bit compatibility of the result.". How does one test for 18.4 quintillion numbers? This isn't possible. |