▲ | hansvm 8 hours ago | |
4. No math libraries, even if their results aren't used in the deterministic path you care about (e.g., floating-point rounding mode bugs) 5. All floating-point optimizations must be hand-rolled. That's implied by (2) and "toolchain cooperates", but it's worth calling out explicitly. Consider, e.g., summing a few thousand floats. On many modern computers, a near-optimal solution is having four vector accumulators, skipping in chunks four vectors wide, adding to each accumulator, adding the accumulators at the end, and then horizontally adding the result (handling any non-length-aligned straggling elements left as an exercise for the reader, and optimal behavior for those varies wildly). However, this has different results depending on whether you use SSE, AVX, or AVX512 if you want to use the hardware to its full potential. You need to make a choice (usually a bias toward wider vector types is better than smaller types, especially on AMD chips, but this is problem-specific), and whichever choice you make you can't let the compiler reorder it. | ||
▲ | AlotOfReading 8 hours ago | parent [-] | |
Neither 4 nor 5 are necessarily required in practice. I converted an employers monorepo to reproducible semi-recently, without eliminating any math libraries, just forcing the compiler to make consistent implementation choices across platforms. Indeed, you can't do that kind of hsum. Still, haven't run into that behavior in practice (and given that I maintain a FP reproducibility test suite, I've tried). Would be open to any way you can suggest to replicate that with the usual GCC/clang etc. |