Remix.run Logo
AlotOfReading a day ago

Most people don't really care about numerical stability or correctness. What they usually want is reproducibility, but they go down a rabbit hole with those other topics as a way to get it, at least in part because everyone thinks reproducibility is too slow.

It was 20 years ago, but that's not the case today. The vast majority of hardware today implements 754 reproducibly if you're willing to stick to a few basic principles:

1. same inputs

2. same operations, in the same order

3. no "special functions", denormals, or NaNs.

If you accept these restrictions (and realistically you weren't handling NaNs or denormals properly anyway), you can get practical reproducibility on modern hardware for minimal (or no) performance cost if your toolchain cooperates. Sadly, toolchains don't prioritize this because it's easy to get wrong across the scope of a modern language and users don't know that it's possible.

craigacp 19 hours ago | parent | next [-]

The same operations in the same order is a tough constraint in an environment where core count is increasing and clock speeds/IPC are not. It's hard to rewrite some of these algorithms to use a parallel decomposition that's the same as the serial one.

I've done a lot of work on reproducibility in machine learning systems, and its really, really hard. Even the JVM got me by changing some functions in `java.lang.Math` between versions & platforms (while keeping to their documented 2ulp error bounds).

Dylan16807 12 hours ago | parent | next [-]

Most people aren't spreading single calculations across multiple cores, and the ones that do are already deep enough into the technical weeds to handle deterministic chunking and combining.

"parallel decomposition that's the same as the serial one" would be difficult in many ways, but only needed when you can't afford a one-time change.

AlotOfReading 16 hours ago | parent | prev | next [-]

Many systems I've worked on can be written as embarrassingly parallel with synchronized checkpoints for a relatively small cost if they're not serial, but yeah, the middle ground is hard. That's the nature of the beast though. Imagine trying to rearrange the instructions in your program the same way. You're going to get all sorts of wacky behavior unless you invest an enormous amount of effort into avoiding problems the way superscalar CPUs do.

Relatively few programs truly need to operate that way though and they're usually written by people who know the cost of their choices.

saagarjha 13 hours ago | parent | prev [-]

You can definitely do operations in a reproducible way (assuming you do the reductions in a defined order) but, yeah, you might lose some performance. Unfortunately the ML people seem to pick better performance over correctness basically every time :(

aragilar 9 hours ago | parent [-]

But is that correctness or reproducibility? The latter isn't important as the former.

hansvm 8 hours ago | parent | prev | next [-]

4. No math libraries, even if their results aren't used in the deterministic path you care about (e.g., floating-point rounding mode bugs)

5. All floating-point optimizations must be hand-rolled. That's implied by (2) and "toolchain cooperates", but it's worth calling out explicitly. Consider, e.g., summing a few thousand floats. On many modern computers, a near-optimal solution is having four vector accumulators, skipping in chunks four vectors wide, adding to each accumulator, adding the accumulators at the end, and then horizontally adding the result (handling any non-length-aligned straggling elements left as an exercise for the reader, and optimal behavior for those varies wildly). However, this has different results depending on whether you use SSE, AVX, or AVX512 if you want to use the hardware to its full potential. You need to make a choice (usually a bias toward wider vector types is better than smaller types, especially on AMD chips, but this is problem-specific), and whichever choice you make you can't let the compiler reorder it.

AlotOfReading 8 hours ago | parent [-]

Neither 4 nor 5 are necessarily required in practice. I converted an employers monorepo to reproducible semi-recently, without eliminating any math libraries, just forcing the compiler to make consistent implementation choices across platforms.

Indeed, you can't do that kind of hsum. Still, haven't run into that behavior in practice (and given that I maintain a FP reproducibility test suite, I've tried). Would be open to any way you can suggest to replicate that with the usual GCC/clang etc.

dzaima 19 hours ago | parent | prev | next [-]

Even denormals and NaNs should be perfectly consistent, at least on CPUs. (as long as you're not inspecting the bit patterns of the NaNs, at least)

Irrational stdlib functions (trig/log/exp; not sqrt though for whatever reason) should really be basically the only source of non-reproducibility in typical programs (assuming they don't explicitly do different things depending on non-controlled properties; and don't use libraries doing that either, which is also a non-trivial ask; and that there's no overly-aggressive optimizer that does incorrect transformations).

I'd hope that languages/libraries providing seeded random sources with a guarantee of equal behavior across systems would explicitly note which operations aren't reproducible though, otherwise seeding is rather pointless; no clue if R does that.

AlotOfReading 16 hours ago | parent [-]

Denormals are consistent, but there's hardware variability in whether or not they're handled (even between different FPUs on the same processor in some cases. Depending on codegen and compiler settings you might get different results.

NaN handling is ambiguous in two different ways. First, binary operations taking two NaNs are specified to return one of them in IEEE-754. Which one (if they're different) isn't defined and different hardware makes different choices. Second, I'm not aware of any language where NaN propagation has simple and defined semantics in practice. LLVM essentially says "you'll get one of 4 randomly chosen behaviors in each function", for example.

dzaima 10 hours ago | parent [-]

The only case I recall of denormals specifically being improperly handled is 32-bit ARM NEON; and indeed clang and gcc just don't use it without -ffast-math.

NaNs indeed can get different bit patterns, but my point was that, given that you need to explicitly bit-reinterpret the NaN to have that affect anything, that can't really affect most code. I guess perhaps blindly hashing a NaNs bits? (that of course goes under your "you weren't handling NaNs […] properly anyway")

AlotOfReading 8 hours ago | parent [-]

The Intel compiler used to (and possibly still does) set the SSE FTZ bit by default, leading to a situation where two different FPUs on the same chip (x87 and SSE) had different FP configs. I was indeed thinking of ARM though. I've come across chips with both VFP and NEON where they're handled differently.

For NaNs, if we're being strict then you don't need to reach that far. Here's one example from LLVM:

https://github.com/llvm/llvm-project/issues/43070)

This can occasionally result in visible bugs, like: https://github.com/rust-lang/rust/issues/110174

You have to be pretty nitpicky to care about this stuff and it's not an issue in practice, but it's a reproducibility violation because it deviates from the standard nonetheless.

dzaima 6 hours ago | parent [-]

Intel compilers default to -ffast-math wholesale, so they're just entirely broken: https://godbolt.org/z/18xoEf5Gf

Those LLVM issues are unrelated afaict - the former is high-level IR optimizations that just result in non-reproducible NaN bit patterns (and the things that wouldn't be just NaN bit patterns I can't even repro on the reported llvm 8.0 or any other version I tested)

The latter though is indeed a proper compiler bug, min/max are an awful mess... didn't know that the ARM fminnm/fmaxnm that were that funky though, caring about sNaN vs qNaN... makes them essentially unusable for compilers (not even useful in known-not-NaN contexts, as then fmin/fmax are equivalent)

aragilar 9 hours ago | parent | prev | next [-]

I mean, if you get different results for the same input on the same machine with the same code, I'd be concerned. But that's not that interesting really, as you actually care that you get the right answer, not the same answer (otherwise you get https://en.wikipedia.org/wiki/Oil_drop_experiment). Stability and correctness will get you closer to a right answer than demanding you get the same answer to someone else who's solving the wrong problem the wrong way.

AlotOfReading 7 hours ago | parent [-]

I don't think it's usually meaningful to discuss the idea of a "correct answer" without doing the traditional numerical analysis stuff. Realistically, we don't need to formalize most programs to that degree and provide precise error bounds either.

To give a practical example, I work on autonomous vehicles. My employer runs a lot of simulations against the vehicle code to make sure it meets reasonable quality standards. The hardware the code will be running on in the field isn't necessarily available in large quantities in the datacenter though. And realistically I'm not going to be able to apply numerical analysis to thousands of functions and millions of test cases to determine what the "correct" answer is in all cases. Instead, I can privilege whatever the current result is as "correct enough" and ensure the vehicle hardware always gets the same results as the datacenter hardware, then leverage the many individual developers merging changes to review their own simulation results knowing they'll apply on-road. This is a pretty universal need for most software projects. If I'm a frontend dev, I want to know that my website will render the same on my computer as the client's. If I'm a game engine dev, I want to ensure that replays don't diverge from the in-game experience. If I'm a VFX programmer, I want to ensure that the results generated on the artist's workstation looks the same as what comes out of the render pipeline at the end. Etc.

All of these applications can still benefit from things like stability, but the benefits are orthogonal to reproducibility.

thaumasiotes 18 hours ago | parent | prev [-]

I don't actually understand why you'd want reproducibility in a statistical simulation. If you fix the output, what are you learning? The point of the simulation is to produce different random numbers so you can see what the outcomes are like... right?

Let's say I write a paper that says "in this statistical model, with random seed 1495268404, I get Important Result Y", and you criticize me on the grounds that when you run the model with random seed 282086400, Important Result Y does not hold. Doesn't this entire argument fail to be conceptually coherent?

hansvm 8 hours ago | parent | next [-]

- It eliminates one way of forging results. Without stating the seed, people could consistently fail to reproduce your results and you could always have "oops, guess I had a bad seed" as an excuse. You still have to worry about people re-running the simulation till the results look good, but that's handled via other mechanisms.

- Many algorithms are vastly easier to implement stochastically than deterministically. If you want to replay the system (e.g., to locally debug some production issue), you need those "stochastic" behaviors to nonetheless be deterministic.

- If you're a little careful with how you implement deterministic randomness, you can start to ask counterfactual questions -- how would the system have behaved had I made this change -- and actually compare apples to apples when examining an experimental run.

Even in your counterexample, the random seeds being reproducible and published is still important. With the seed and source published, now anyone can cheaply verify the issue with the simulation, you can debug it, you can investigate the proportion of "bad" seeds and suss out the error bounds of the simulation, etc.

thaumasiotes 2 hours ago | parent [-]

> It eliminates one way of forging results. Without stating the seed, people could consistently fail to reproduce your results and you could always have "oops, guess I had a bad seed" as an excuse.

Why does this matter? If people consistently fail to get results similar to yours by using different seeds, your results are invalid whether or not you made them up. And if people consistently do get results similar to yours with different seeds, your results are valid whether or not you made them up. No?

> Even in your counterexample, the random seeds being reproducible and published is still important.

> you can investigate the proportion of "bad" seeds and suss out the error bounds of the simulation

How does publishing seeds help with that? You do that by examining different seeds. If you know what seeds were used in the paper, this process doesn't change in any way.

AlotOfReading 16 hours ago | parent | prev [-]

The point is that if you run the simulation with seed X, you'll get the same results I do. This means that all you need to reproduce my results is the code and any inputs like the seed, rather than the entire execution history. If you want to provide different inputs and get the same result, that's another matter entirely (where numerical stability will be much more important).

thaumasiotes 13 hours ago | parent [-]

But what is the value of reproducing your results that way? Those results only mean anything if they're similar to what happens with other seeds.

dzaima 10 hours ago | parent | next [-]

If everything goes roughly the same with all seeds, then everything's indeed just fine. If not, I have a paper trail that I did in fact actually encounter the unexpected thing instead of just making up results. (indeed a seed could be cherry-picked, but the more is computed from the seed, the more infeasible that gets)

And that hints at the possibility of using seeds specifically for noting rare occurrences - communicating weird/unexpected(/buggy?) examples to collaborators, running for X hours to gather statistics without having to store all intermediate states of every single attempt if you later want to do extra analysis for cases insufficiently analyzed by the existing code, etc.

saagarjha 13 hours ago | parent | prev [-]

If something fails, I can reproduce the failure with your seed.