▲ | ismailmaj 3 days ago | ||||||||||||||||
The way I explained it to myself in the past why so much of the CUDA algorithms don't care much about numerical stability is that the error is a form of regularization (i.e. less overfitting over the data) in deep learning. | |||||||||||||||||
▲ | Nevermark 3 days ago | parent [-] | ||||||||||||||||
I am not quite sure what that means! :) But reasons why deep learning training is very robust to moderate inaccuracy in gradients: 1. Locally, sigmoid and similar functions are the simplest smoothest possible non-linearity to propagate gradients through. 2. Globally, outside of deep recurrent networks, there is no recursion which makes the total function smooth and well behaved. 3. While the perfect gradient indicates the ideal direction to adjust parameters, for fastest improvement, all that is really needed to reduce error is to move parameters in the direction of the gradient signs, with a small enough step. That is a very low bar. It's like telling an archer they just need to shoot an arrow so it lands closer to the target than where the archer is standing, but not worry about hitting it! 4. Finally, the perfect first order gradient is only meaningful at one point of the optimization surface. Moving away from that point, i.e. updating the parameters at all, and the gradient changes quickly. So we are in gradient heuristic land even with "perfect" first order gradients. The most perfectly calculated gradient isn't actually "accurate" already. To actually get an accurate gradient over a parameter step, would take fitting the local gradient with a second or third order polynomial. I.e. not just first, but second and third order derivatives. At vastly greater computational and working memory cost. -- The only critical issue for calculating gradients, is that there is enough precision that at least directional gradient information makes it from errors back to the parameters to update. If precision is too low, then the variable magnitude rounding inherent to floating point arithmetic can completely drop directional information for smaller gradients. Without accurate gradient signs, learning stalls. | |||||||||||||||||
|