The way I explained it to myself in the past why so much of the CUDA algorithms don't care much about numerical stability is that the error is a form of regularization (i.e. less overfitting over the data) in deep learning.

▲

Nevermark 3 days ago | parent [-]

I am not quite sure what that means! :)

But reasons why deep learning training is very robust to moderate inaccuracy in gradients:

1. Locally, sigmoid and similar functions are the simplest smoothest possible non-linearity to propagate gradients through.

2. Globally, outside of deep recurrent networks, there is no recursion which makes the total function smooth and well behaved.

3. While the perfect gradient indicates the ideal direction to adjust parameters, for fastest improvement, all that is really needed to reduce error is to move parameters in the direction of the gradient signs, with a small enough step. That is a very low bar.

It's like telling an archer they just need to shoot an arrow so it lands closer to the target than where the archer is standing, but not worry about hitting it!

4. Finally, the perfect first order gradient is only meaningful at one point of the optimization surface. Moving away from that point, i.e. updating the parameters at all, and the gradient changes quickly.

So we are in gradient heuristic land even with "perfect" first order gradients. The most perfectly calculated gradient isn't actually "accurate" already.

To actually get an accurate gradient over a parameter step, would take fitting the local gradient with a second or third order polynomial. I.e. not just first, but second and third order derivatives. At vastly greater computational and working memory cost.

The only critical issue for calculating gradients, is that there is enough precision that at least directional gradient information makes it from errors back to the parameters to update. If precision is too low, then the variable magnitude rounding inherent to floating point arithmetic can completely drop directional information for smaller gradients. Without accurate gradient signs, learning stalls.

▲

ismailmaj 3 days ago | parent [-]

Typically for matrix multiplications there is a wide range of algorithms you could use to compute it, on one extreme end you could use numerically stable summation and the other extreme you could have tiled matmul with FP8, the industry trend seems to go further away from numerical stable algorithms without much quality drop it seems. My claim is probably unfair since it ignores the scale you gain from the speed/precision tradeoff, so I assumed numerical stability is not that beneficial compared to something precision heavy like physics simulation in HPC.

	▲	Nevermark 3 days ago \| parent [-]
		> I assumed numerical stability is not that beneficial compared to something precision heavy like physics simulation in HPC. Yes, exactly. For physics, there is a correct result. I.e. you want your simulation to reflect reality with high accuracy, over a long chain of calculations. Extremely tight constraint. For deep learning, you don't have any specific constraints on parameters, except that you want to end up with a combination that fits the data well. There are innumerable combinations of parameter values that will do that, you just need to find one good enough combination. Wildly different.