Avoids vanishing gradients in deeper networks.
Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.