Remix.run Logo
godelski 2 days ago

Normalization doesn't fix the vanishing gradient problem. Like you point out, it doesn't go away with infinite precision because that's not the problem. (I've got a Ph.D in this btw)

D-Machine 2 days ago | parent [-]

Yup, normalization doesn't fix vanishing gradients. It just sort of generally helps with conditioning (as you likely know, given the PhD, probably in part because it helps keep certain Lipschitz constants from being too large), just like typical residual connections generally help with conditioning. But yeah, in that sense, normalization is more about exploding gradients, and residual connections are more about vanishing. A bit sloppy on my part.