I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.

▲

hodgehog11 an hour ago | parent [-]

It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.

	▲	anvuong 31 minutes ago \| parent \| next [-]
		A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.
	▲	appplication an hour ago \| parent \| prev [-]
		Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent. E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.