Remix.run Logo
godelski 3 days ago

Residual connections create numerical stability with respect to the gradient and network updating. Look up the vanishing gradient problem. We wouldn't be able to build these deep networks without them.

It's easier to see looking at ReLU because you'll clearly get dead neurons, when the gradient hits 0 you can't recover.

This can also me mitigated by modem activation functions that are smooth but the other problem is that your gradient signal degrades as you move further down the network. Residuals help keep a stronger signal being able to propagate through the whole network. You can see this one if you run the numbers by hand for even not so deep of a network.

So just saying, "numerically stable" means lots of things...

D-Machine 3 days ago | parent [-]

Well, you can build deep networks without residual connections, as long as you keep normalizing and initialize very carefully. I'm sure there are other clever ways to keep things well conditioned, like fancy reparameterizations or e.g. spectral norms. Just residual connections are dead simple to implement and reason about, and fast, to boot, so it isn't a huge surprise they feature in most of the most successful models.

And yes, "numerically stable" is a vagueness here. I had implied above that gradient vanishing wouldn't be such a problem with infinite precision, but, actually, this isn't true, if you think about it a bit. You're still going to get stuck on plateaus even with infinite precision. In this sense, if by "numerically stable" one means "with respect to floating point issues", well, then, GP is correct, because that kind of numerical instability is actually just not really a serious issue for training modern networks (at least, certainly not if training with fp32). Instability in optimization is because gradient noise (from e.g. mini-batch sampling) will swamp the useful gradient signal when gradients get too small, i.e. you start doing a random walk.

godelski 2 days ago | parent [-]

Normalization doesn't fix the vanishing gradient problem. Like you point out, it doesn't go away with infinite precision because that's not the problem. (I've got a Ph.D in this btw)

D-Machine 2 days ago | parent [-]

Yup, normalization doesn't fix vanishing gradients. It just sort of generally helps with conditioning (as you likely know, given the PhD, probably in part because it helps keep certain Lipschitz constants from being too large), just like typical residual connections generally help with conditioning. But yeah, in that sense, normalization is more about exploding gradients, and residual connections are more about vanishing. A bit sloppy on my part.