To make this more concrete: ImageNet enabled computer "vision" by providing images + labels, enabling the computer to take an image and spit out a label. LLM training sets enable text completion by providing text + completions, enabling the computer to take a piece of text and spit out its completion. Learning how the physical world works (not just kind of works a la videogames, actually works) is not only about a jillion times more complicated, there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.

"Spatial awareness" itself is kind of a simplification: the idea that you can be aware of space or 3d objects' behavior without the social context of what an "object" is or how it relates to your own physical existence. Like you could have two essentially identical objects but they are not interchangeable (original Declaration of Independence vs a copy, etc). And many many other borderline-philosophical questions about when an object becomes two, etc.

▲

m-s-y 2 days ago | parent | next [-]

> the world itself, which cannot be compacted or fed into a computer at high speed.

…yet.

15 years ago LLMs as they are today seemed like science fiction too.

	▲	awakeasleep 2 days ago \| parent \| next [-]
		Yes! It only requires a few fundamental breakthroughs in areas that seem constrained by physical reality.
	▲	smohare a day ago \| parent \| prev [-]
		[dead]

▲

sega_sai 2 days ago | parent | prev | next [-]

I feel that if words/phrases/whole texts can be embedded well in high dimensional spaces as points, the same must apply to the 3d world. I'm sure there will be embeddings of it (i.e. mapping the 3-d scene into a high-D vector) and then we'll be work with those embeddings as LLMs work with text (disclaimer: I am not an expert in the field ).

▲

amelius 2 days ago | parent | prev | next [-]

Well, you can use one or two cameras and a lidar, and use that to generate data to train a depth-map.

▲

coldtea 3 days ago | parent | prev | next [-]

>there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.

Why wouldn't it be? If the world is ingressed via video sensors and lidar sensor, what's the hangup in recording such input and then replaying it faster?

▲

psb217 3 days ago | parent | next [-]

I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue.

▲

HappMacDonald 2 days ago | parent | next [-]

Yeah I'm envisioning putting a thousand simplistic robotic "infants" into a vast "playpen" to gather sensor data about their environment, for some (probably smaller) number of deep learning models to ingest the input and guess at output strategies (move this servo, rotate this camshaft this far in that direction, etc) and make predictions about resulting changes to input.

In principle a thousand different deep learning models could all train simultaneously on a thousand different robot experience feeds.. but not 1 to 1, but instead 1 to many.. each neural net training on data from dozens or hundreds of the robots at the same time, and different neural nets sharing those feeds for their own rounds of training.

Then of course all of the input data paired with outputs tested and further inputs as ground truth to predictions can be recorded for continued training sessions after the fact.

▲

csullivan107 2 days ago | parent | next [-]

Never thought I’d get to do this but this was my masters research! Simulations are inherently limited and I just got tired of robotic research being done only in simulations. So I built a novel soft robot (notoriously difficult to control) and got it to learn by playing!!

Here is an informal talk I gave on my work. Let me know if you want the thesis

https://www.youtube.com/live/ZXlQ3ppHi-E?si=MKcRqoxmEra7Zrt5

▲

rybosome 2 days ago | parent | prev | next [-]

A very interesting idea. I am curious about this sharing and blending of the various nets; I wonder if something as naive as averaging the weights (assuming the neural nets all have the same dimensions) would actually accomplish that?

▲

loa_in_ 2 days ago | parent | prev [-]

But the playpen will contain objects that are inherently breakable. You cannot rough handle the glass vessel and have it too.

▲

HappMacDonald a day ago | parent | next [-]

Basically everything applicable to the playpen of a human baby is applicable to the playpen of an AI robot baby in this setup, to at least some degree.

Perhaps the least applicable part is that "robot hurting itself" has the liability of some cost to replace the broken robot part, vs the potentially immeasurable cost of a human infant injuring themselves.

If it's not a good idea to put a "glass vessel" in a human crib (strictly from an "I don't want the glass vessel to be damaged" sense) then it's not a good idea to put that in the robot-infant crib either.

Give them something less expensive to repair, like a stack of blocks instead. :P

▲

m-s-y 2 days ago | parent | prev [-]

The world Is breakable. Any model based on it will need to know this anyway. Am I missing your argument?

	▲	devenson 2 days ago \| parent [-]
		Can't reset state after breakage.

▲

hackyhacky 2 days ago | parent | prev [-]

> In that case, you're bottlenecked by the speed of the world

Why not have the AI train on a simulation of the real world? We can build those pretty easily using traditional software and run them at any speed we want.

▲

otodus 2 days ago | parent | prev [-]

How would you handle olfactory and proprioceptive data?

▲

TheOtherHobbes 2 days ago | parent | prev [-]

Considering how bad LLMs are at understanding anything, and how they still manage to be useful, you simply don't need this level of complexity.

You need something that mostly works most of the time, and has guardrails so when it makes mistakes nothing bad happens.

Our brains acquire quite good heuristics for dealing with physical space without needing to experience all of physical reality.

A cat-level or child-level understanding of physical space is more immediately useful than a philosopher-level of understanding.