Well, you can build deep networks without residual connections, as long as you keep normalizing and initialize very carefully. I'm sure there are other clever ways to keep things well conditioned, like fancy reparameterizations or e.g. spectral norms. Just residual connections are dead simple to implement and reason about, and fast, to boot, so it isn't a huge surprise they feature in most of the most successful models.

And yes, "numerically stable" is a vagueness here. I had implied above that gradient vanishing wouldn't be such a problem with infinite precision, but, actually, this isn't true, if you think about it a bit. You're still going to get stuck on plateaus even with infinite precision. In this sense, if by "numerically stable" one means "with respect to floating point issues", well, then, GP is correct, because that kind of numerical instability is actually just not really a serious issue for training modern networks (at least, certainly not if training with fp32). Instability in optimization is because gradient noise (from e.g. mini-batch sampling) will swamp the useful gradient signal when gradients get too small, i.e. you start doing a random walk.

▲

godelski 2 days ago | parent [-]

Normalization doesn't fix the vanishing gradient problem. Like you point out, it doesn't go away with infinite precision because that's not the problem. (I've got a Ph.D in this btw)

	▲	D-Machine 2 days ago \| parent [-]
		Yup, normalization doesn't fix vanishing gradients. It just sort of generally helps with conditioning (as you likely know, given the PhD, probably in part because it helps keep certain Lipschitz constants from being too large), just like typical residual connections generally help with conditioning. But yeah, in that sense, normalization is more about exploding gradients, and residual connections are more about vanishing. A bit sloppy on my part.