I can look around me and find the minimum of anything without tracing its surface and following the gradient. I can also identify immediately global minima instead of local ones.

We all can do it in 2-3D. But our algorithms don’t do it. Even in 2D.

Sure if I was blindfolded, feeling the surface and looking for minimization direction would be the way to go. But when I see, I don’t have to.

What are we missing?

▲

ks2048 4 days ago | parent | next [-]

When you look at a 2D surface, you directly observe all the values on that surface.

For a loss-function, the value at each point must be computed.

You can compute them all and "look at" the surface and just directly choose the lowest - that is called a grid search.

For high dimensions, there's just way too many "points" to compute.

▲

samsartor 4 days ago | parent | next [-]

And remember, optimization problems can be _incredibly_ high-dimensional. A 7B parameter LLM is a 7-billion-dimensional optimization landscape. A grid-search with a resolution of 10 (ie 10 samples for each dimension) would requre evaluating the loss function 10^(7*10^9) times. That is, the number of evaluations is a number with 7B digits.

▲

cubefox 3 days ago | parent | prev [-]

What about sampling at low resolution? If the hills and valleys aren't too close together, this should give a good indication of where the global minimum is.

▲

xigoi 3 days ago | parent [-]

> If the hills and valleys aren't too close together

That’s a big “if”.

▲

cubefox 3 days ago | parent [-]

At least it will catch those valleys that are wider than the sampling resolution.

▲

xigoi 3 days ago | parent [-]

Yeah. The problem is that the number of samples needed is exponential in the dimension, so in a 1000-dimensional space, you won’t even be able to subdivide it into 2×…×2.

	▲	cubefox 3 days ago \| parent [-]
		Damn.

▲

Chinjut 4 days ago | parent | prev | next [-]

You're thinking of situations where you are able to see a whole object at once. If you were dealing with an object too large to see all of, you'd have to start making decisions about how to explore it.

	▲	3eb7988a1663 4 days ago \| parent [-]
		The mental image I like: imagine you are lost in a hilly region with incredibly dense fog such that you can only see one foot directly in front of you. How do you find the base of the valley? Gradient descent: take a step in the steepest downward direction. Look around and repeat. When you reach a level area, how do you know you are at the lowest point?

▲

jpeloquin 4 days ago | parent | prev | next [-]

Evaluating a function using a densely spaced grid and plotting it does work. This is brute-force search. You will see the global minima immediately in the way you describe, provided your grid is dense enough to capture all local variation.

It's just that when the function is implemented on the computer, evaluating so many points takes a long time, and using a more sophisticated optimization algorithm that exploits information like the gradient is almost always faster. In physical reality all the points already exist, so if they can be observed cheaply the brute force approach works well.

Edit: Your question was good. Asking superficially-naive questions like that is often a fruitful starting point for coming up with new tricks to solve seemingly-intractable problems.

	▲	whatever1 4 days ago \| parent [-]
		Thanks! It does feels to me that we do some sort of sampling, definitely is not a naive grid search. Also I find it easier to find the minima in specific directions (up, down, left, right) rather than let’s say a 42 degree one. So some sort of priors are probably used to improve sample efficiency.

▲

nwallin 4 days ago | parent | prev | next [-]

When you look at, for instance, a bowl, or even one of those egg carton mattress things, and you want to find the global minimum, you are looking at a surface which is 2 dimensions in and 1 dimension out. It's easy enough for your brain to process several thousand points and say ok the bottom of the bowl is right here.

When a computer has a surface which is 2 dimensions in and 1 dimension out, you can actually just do the same thing. Check like 100 values in the x/y directions and you only have to check like 10000 values. A computer can do that easy peasy.

When a computer does ML with a deep neural network, you don't have 2 dimensions in and 1 dimension out. You have thousands to millions of dimensions in and thousands to millions of dimensions out. If you have 100000 inputs, and you check 1000 values for each input, the total number of combinations is 1000^100000. Then remember that you also have 100000 outputs. You ain't doin' that much math. You ain't.

So we need fancy stuff like Jacobians and backtracking.

▲

whatever1 4 days ago | parent [-]

I don’t think it’s that simple. For the egg carton your eye will not spend almost any time looking at its top. You will spend most of the time sampling the bottom. I don’t know what we do, but it does not feel like a naive grid search.

	▲	cvoss 4 days ago \| parent [-]
		I really don't think you have the ability to use self-reflection to discern an algorithm that occurs in your unconscious visual cortex in a split second. You wouldn't feel like you were doing a naive grid search even if a naive grid search is exactly what you were doing. You have suggested that the process in your mind to find a global minimum is immediate, apparently to contrast this with a standard computational algorithm. But such comparison fails. I don't know whether you mean "with few computational steps" or "in very little time"; the former is not knowable to you; the latter is not relevant since the hardware is not the same.

▲

zoogeny 4 days ago | parent | prev | next [-]

People here are giving you mathematical answers which is what you are asking for, but I want to challenge your intuition here.

In construction, grading a site for building is a whole process involving surveying. If you dropped a person on a random patch of earth that hasn't previously been levelled and gave them no tools, it would be a significant challenge for that person to level the ground correctly.

What I'm saying is, your intuition that "I can look around me and find the minimum of anything" is almost certainly wrong, unless you have a superpower that no other person has.

	▲	whatever1 4 days ago \| parent [-]
		That is true we are only good at doing it for specific directions of the objective function. The one that we perceive as the minimizing direction. If you tell me find the minimum with a direction of 53 degrees likely I will fail, because I can’t easily visualize where this direction points towards

▲

shoo 3 days ago | parent | prev | next [-]

Many practical optimisation problems are less like "let's go hiking and climb a literal hill which we can see in front of us" and more like "find the best design in this space of possible designs that maximises some objective"

Here are some alternative example problems, that are a lot more high dimensional, and also where the dimensions are not spatial dimensions so your eyes give you absolutely no benefit.

(a) Your objective is to find a recipe that produces a maximally tasty meal, using the ingredients you have in your kitchen cupboard. To sample one point in recipe-space, you need to (1) devise a recipe, (2) prep and cook a candidate meal following the recipe, and (3) evaluate the candidate recipe, say by serving it to a bunch of your friends and family. That gets you one sample point. Maybe there are 1 trillion possible "recipes" you could make. Are you going to brute-force cook and serve them all to find a meal that maximises tastiness, or is there a more efficient way that requires fewer plan recipe->prep&cook->serve->evaluate cycles?

(b) Your objective is to find the most efficient design of a bridge, that can support the required load and stresses, while minimising the construction cost.

▲

GuB-42 4 days ago | parent | prev | next [-]

Your eyes compute gradients, as part of the shitton of visual processing your brain does to get an estimate of where the local and global minima are.

It is not perfect though, see the many optical illusions.

But we follow gradients all the time, consciously or not. You know you are at the bottom of the hole when all the paths go up for instance.

	▲	dcanelhas 3 days ago \| parent [-]
		It has been suggested [citation needed] that the optical illusions of movement caused by gradients are there to to compensate for the time it takes to process the visual input. This should let you have an understanding of what is going on in the world around you right now, based on what happened on your retinas a few milliseconds ago. it's not a bug - it's a feature :D

▲

i_am_proteus 4 days ago | parent | prev | next [-]

Without looking up the answer (because someone has already computed this for you), how would you find the highest geographic point (highest elevation) in your country?

▲

raffael_de 4 days ago | parent | prev | next [-]

well, first of all ... you can't. and it is very easy to come up with all sorts of (not even special) cases where you simply couldn't for literally obvious reasons. what you are imagining is some sort of stereoscopic ray tracing. that is anyway much more compute intensive then calculating a derivative.

▲

adrianN 4 days ago | parent | prev | next [-]

The inputs you can process visually are of trivial size even for naive algorithms, and probably also simple instances. I certainly can’t find global minima in 2d for any even slightly adversarial function.

▲

cinntaile 4 days ago | parent | prev | next [-]

What if you're trying to find the minimum of something that you can't see? Or what if the differences are so small that you can't perceive them with your eyes even though you can see?

▲

hackinthebochs 4 days ago | parent | prev | next [-]

You're ignoring all the calculations that go on unconsciously that realize your conscious experience of "immediately" apprehending the global minima.

▲

fancyfredbot 4 days ago | parent | prev | next [-]

Your visual cortex is a massively parallel processor.

▲

pestatije 4 days ago | parent | prev [-]

touch and sight sense essentially the same...the difference is in the magnitudes involved