ks2048 4 days ago

When you look at a 2D surface, you directly observe all the values on that surface.

For a loss-function, the value at each point must be computed.

You can compute them all and "look at" the surface and just directly choose the lowest - that is called a grid search.

For high dimensions, there's just way too many "points" to compute.

samsartor 4 days ago | parent | next [-]

And remember, optimization problems can be _incredibly_ high-dimensional. A 7B parameter LLM is a 7-billion-dimensional optimization landscape. A grid-search with a resolution of 10 (ie 10 samples for each dimension) would requre evaluating the loss function 10^(7*10^9) times. That is, the number of evaluations is a number with 7B digits.

▲

cubefox 3 days ago | parent | prev [-]

What about sampling at low resolution? If the hills and valleys aren't too close together, this should give a good indication of where the global minimum is.

▲

xigoi 3 days ago | parent [-]

> If the hills and valleys aren't too close together

That’s a big “if”.

▲

cubefox 3 days ago | parent [-]

At least it will catch those valleys that are wider than the sampling resolution.

▲

xigoi 3 days ago | parent [-]

Yeah. The problem is that the number of samples needed is exponential in the dimension, so in a 1000-dimensional space, you won’t even be able to subdivide it into 2×…×2.

	▲	cubefox 3 days ago \| parent [-]
		Damn.