Remix.run Logo
jandrewrogers 3 days ago

I appreciate the video and generally agree with Fei-Fei but I think it almost understates how different the problem of reasoning about the physical world actually is.

Most dynamics of the physical world are sparse, non-linear systems at every level of resolution. Most ways of constructing accurate models mathematically don’t actually work. LLMs, for better or worse, are pretty classic (in an algorithmic information theory sense) sequential induction problems. We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch.

There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

I vigorously agree that the ability to reason about spatiotemporal dynamics is critical to general AI. But the computer science required is so different from classical AI research that I don’t expect any pure AI researcher to bridge that gap. The other aspect is that this area of research became highly developed over two decades but is not in the public literature.

One of the big questions I have had since they announced the company, is who on their team is an expert in the dark state-of-the-art computer science with respect to working around these particular problems? They risk running straight into the same deep, layered theory walls that almost everyone else has run into. I can’t identify anyone on the team that is an expert in a relevant area of computer science theory, which makes me skeptical to some extent. It is a nice idea but I don’t get the sense they understand the true nature of the problem.

Nonetheless, I agree that it is important!

teemur 3 days ago | parent | next [-]

> We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

This made me a bit curious. Would you have any pointers to books/articles/search terms if one wanted to have a bit deeper look on this problem space and where we are?

jandrewrogers 2 days ago | parent [-]

I'm not aware of any convenient literature but it is relatively obvious once someone explains it to you (as it was explained to me).

At its root it is a cutting problem, like graph cutting but much more general because it includes things like non-trivial geometric types and relationships. Solving the cutting problem is necessary to efficiently shard/parallelize operations over the data models.

For classic scalar data models, representations that preserve the relationships have the same dimensionality as the underlying data model. A set of points in 2-dimensions can always be represented in 2-dimensions such that they satisfy the cutting problem (e.g. a quadtree-like representation).

For non-scalar types like rectangles, operations like equality and intersection are distinct and there are an unbounded number of relationships that must be preserved that touch on concepts like size and aspect ratio to satisfy cutting requirements. The only way to expose these additional relationships to cutting algorithms is to encode and embed these other relationships in a (much) higher dimensionality space and then cut that space instead.

The mathematically general case isn't computable but real-world data models don't need it to be. Several decades ago it was determined that if you constrain the properties of the data model tightly enough then it should be possible to systematically construct a finite high-dimensionality embedding for that data model such that it satisfies the cutting problem.

Unfortunately, the "should be possible" understates the difficulty. There is no computer science literature for how one might go about constructing these cuttable embeddings, not even for a narrow subset of practical cases. The activity is also primarily one of designing data structures and algorithms that can represent complex relationships among objects with shape and size in dimensions much greater than three, which is cognitively difficult. Many smart people have tried and failed over the years. It has a lot of subtlety and you need practical implementations to have good properties as software.

About 20 years ago, long before "big data", the iPhone, or any current software fashion, this and several related problems were the subject of an ambitious government research program. It was technically successful, demonstrably. That program was killed in the early 2010s for unrelated reasons and much of that research was semi-lost. It was so far ahead of its time that few people saw the utility of it. There are still people around that were either directly involved or learned the computer science second-hand from someone that was but there aren't that many left.

calf 2 days ago | parent | next [-]

But then that sounds more like that person explained it wrong. They didn't explain why it is necessary to reduce to GRAPHCUT, it seems to me to beg the question. We should not assume this is true based on some vague anthropomorphic appeal to spatial locality, surely?

jandrewrogers 2 days ago | parent [-]

It isn’t a graph cutting problem, graph cutting is just a simpler, special case of this more general cutting problem (h/t IBM Research). If you can solve the general problem you effectively get efficient graph cutting for free. This is obviously attractive to the extent you can do both complex spatial and graph computation at scale on the same data structure instead of specializing for one or the other.

The challenge with cutting e.g. rectangles into uniform subsets is that logical shard assignment must be identical regardless of insertion order and in the absence of an ordering function, with O(1) space complexity and without loss of selectivity. Arbitrary sets of rectangles overlap, sometimes heavily, which is the source of most difficulty.

Of course, with practical implementations write scalability matters and incremental construction is desirable.

calf a day ago | parent [-]

Well, previously you said that it (presumably "it" broadly refers to spatial reasoning AI) is a "high dimensional complex type cutting problem".

You said this is obvious once explained. I don't see this as obvious, rather, I see this as begging the question--the research program you were secretly involved in wanted to parallelize the engineering of it so obviously they needed some fancy "cutting algorithm" to make it possible.

The problem is that this conflated the scientific statement of what "spatial reasoning" is. There's no obvious explanation why spatial reasoning should intuitively be some kind of cutting problem however you wish to define or generalize a cutting problem. That's not how good CS research is done or explained.

In fact I could (mimicking your broad assertions) go so far as to claim, the project was doomed to fail because they weren't really trying to understand something, they want to make something without understanding it as the priority. So they were constrained by the parallel technology that they had at the time, and when the computational power available didn't pan out they reached a natural dead end.

andoando 2 days ago | parent | prev | next [-]

Ive spent years trying to tackle spatial representations on my own, so Im extremely curious here.

How does the cutting problem relate to intelligence in the first place?

jandrewrogers 2 days ago | parent [-]

Indexing is a special case of AI. At the limit, optimal cutting and learning are equivalent problems. Non-trivial spatial representations push these two things much closer together than is normally desirable for e.g. indexing algorithms. Tractability becomes a real issue.

Practically, scalable indexing of complex spatial relationships requires what is essentially a type of learned indexing, albeit not neural network based.

RaftPeople 6 hours ago | parent [-]

> is essentially a type of learned indexing, albeit not neural network based.

NN is just function approximation, why do you think that could not be a valuable part of the solution?

It seems like a dynamically adjusted/learned function approximator is a good general tool to most of these hard problems.

tehjoker 2 days ago | parent | prev | next [-]

Did that research program have a public code name?

mindcrime 2 days ago | parent | next [-]

Looking through some old DARPA budget docs[1], it seems like there's a chance that what's being discussed here falls under DARPA's "PE 0602702E TACTICAL TECHNOLOGY" initiative, project TT-06.

Some other possibilities might include:

  - "PE 0602304E COGNITIVE COMPUTING SYSTEMS", project COG-02.
  - "PE 0602716E ELECTRONICS TECHNOLOGY", project ELT-01
  - "PE 0603760E COMMAND, CONTROL AND COMMUNICATIONS SYSTEMS", project CCC-02
  - "PE 0603766E NETWORK-CENTRIC WARFARE TECHNOLOGY", project NET-01
  - "PE 0603767E SENSOR TECHNOLOGY", project SEN-02
Or maybe it's nothing to do with this at all. But in either case, this looks like some interesting stuff to explore in its own right. :-)

[1]: https://web.archive.org/web/20181001000000/https://www.darpa...

jandrewrogers 2 days ago | parent | prev | next [-]

Not that I know of. If I drop the program director’s name, people that know, know. That is all the handshake you usually need.

sho 2 days ago | parent | prev [-]

Sounds like Genoa/Topsail

jedharris 2 days ago | parent | prev [-]

some pointers to the research program please?

jandrewrogers 2 days ago | parent [-]

It was a national security program with no public face. I was recruited into it because I solved a fundamental computer science problem they were deeply interested in. I did not get my extensive supercomputing experience in academia. It was a great experience if you just wanted to do hardcore computer science research, which at the time I did.

There are several VCs with knowledge of the program. It is obscure but has cred with people that know about it. I’ve raised millions of dollars off the back of my involvement.

A lot of really cool computer science research has happened inside the government. I think it is a bit less these days but people still underestimate it.

DennisP a day ago | parent [-]

I'm not surprised that the government does great research, but I wonder how much good does that research does, if it's unpublished and disappears after budget cuts.

lsy 3 days ago | parent | prev | next [-]

To make this more concrete: ImageNet enabled computer "vision" by providing images + labels, enabling the computer to take an image and spit out a label. LLM training sets enable text completion by providing text + completions, enabling the computer to take a piece of text and spit out its completion. Learning how the physical world works (not just kind of works a la videogames, actually works) is not only about a jillion times more complicated, there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.

"Spatial awareness" itself is kind of a simplification: the idea that you can be aware of space or 3d objects' behavior without the social context of what an "object" is or how it relates to your own physical existence. Like you could have two essentially identical objects but they are not interchangeable (original Declaration of Independence vs a copy, etc). And many many other borderline-philosophical questions about when an object becomes two, etc.

m-s-y 2 days ago | parent | next [-]

> the world itself, which cannot be compacted or fed into a computer at high speed.

…yet.

15 years ago LLMs as they are today seemed like science fiction too.

awakeasleep 2 days ago | parent | next [-]

Yes! It only requires a few fundamental breakthroughs in areas that seem constrained by physical reality.

smohare a day ago | parent | prev [-]

[dead]

sega_sai 2 days ago | parent | prev | next [-]

I feel that if words/phrases/whole texts can be embedded well in high dimensional spaces as points, the same must apply to the 3d world. I'm sure there will be embeddings of it (i.e. mapping the 3-d scene into a high-D vector) and then we'll be work with those embeddings as LLMs work with text (disclaimer: I am not an expert in the field ).

amelius 2 days ago | parent | prev | next [-]

Well, you can use one or two cameras and a lidar, and use that to generate data to train a depth-map.

coldtea 3 days ago | parent | prev | next [-]

>there is really only one usable dataset: the world itself, which cannot be compacted or fed into a computer at high speed.

Why wouldn't it be? If the world is ingressed via video sensors and lidar sensor, what's the hangup in recording such input and then replaying it faster?

psb217 3 days ago | parent | next [-]

I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue.

HappMacDonald 2 days ago | parent | next [-]

Yeah I'm envisioning putting a thousand simplistic robotic "infants" into a vast "playpen" to gather sensor data about their environment, for some (probably smaller) number of deep learning models to ingest the input and guess at output strategies (move this servo, rotate this camshaft this far in that direction, etc) and make predictions about resulting changes to input.

In principle a thousand different deep learning models could all train simultaneously on a thousand different robot experience feeds.. but not 1 to 1, but instead 1 to many.. each neural net training on data from dozens or hundreds of the robots at the same time, and different neural nets sharing those feeds for their own rounds of training.

Then of course all of the input data paired with outputs tested and further inputs as ground truth to predictions can be recorded for continued training sessions after the fact.

csullivan107 2 days ago | parent | next [-]

Never thought I’d get to do this but this was my masters research! Simulations are inherently limited and I just got tired of robotic research being done only in simulations. So I built a novel soft robot (notoriously difficult to control) and got it to learn by playing!!

Here is an informal talk I gave on my work. Let me know if you want the thesis

https://www.youtube.com/live/ZXlQ3ppHi-E?si=MKcRqoxmEra7Zrt5

rybosome 2 days ago | parent | prev | next [-]

A very interesting idea. I am curious about this sharing and blending of the various nets; I wonder if something as naive as averaging the weights (assuming the neural nets all have the same dimensions) would actually accomplish that?

loa_in_ 2 days ago | parent | prev [-]

But the playpen will contain objects that are inherently breakable. You cannot rough handle the glass vessel and have it too.

HappMacDonald a day ago | parent | next [-]

Basically everything applicable to the playpen of a human baby is applicable to the playpen of an AI robot baby in this setup, to at least some degree.

Perhaps the least applicable part is that "robot hurting itself" has the liability of some cost to replace the broken robot part, vs the potentially immeasurable cost of a human infant injuring themselves.

If it's not a good idea to put a "glass vessel" in a human crib (strictly from an "I don't want the glass vessel to be damaged" sense) then it's not a good idea to put that in the robot-infant crib either.

Give them something less expensive to repair, like a stack of blocks instead. :P

m-s-y 2 days ago | parent | prev [-]

The world Is breakable. Any model based on it will need to know this anyway. Am I missing your argument?

devenson 2 days ago | parent [-]

Can't reset state after breakage.

hackyhacky 2 days ago | parent | prev [-]

> In that case, you're bottlenecked by the speed of the world

Why not have the AI train on a simulation of the real world? We can build those pretty easily using traditional software and run them at any speed we want.

otodus 2 days ago | parent | prev [-]

How would you handle olfactory and proprioceptive data?

TheOtherHobbes 2 days ago | parent | prev [-]

Considering how bad LLMs are at understanding anything, and how they still manage to be useful, you simply don't need this level of complexity.

You need something that mostly works most of the time, and has guardrails so when it makes mistakes nothing bad happens.

Our brains acquire quite good heuristics for dealing with physical space without needing to experience all of physical reality.

A cat-level or child-level understanding of physical space is more immediately useful than a philosopher-level of understanding.

machinelearning 3 days ago | parent | prev | next [-]

"Most ways of constructing accurate models mathematically don’t actually work" > This is true for almost anything at the limit, we are already able to model spatiotemporal dynamics to some useful degree (see: progress in VLAs, video diffusion, 4D Gaussians)

"We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch" > What's the source that this is a physically impossible problem? Not sure what you mean by impedance mismatch but do you mean that it is unsolvable even with better techniques?

Your whole third paragraph could have been said about LLMs and isn't specific enough, so we'll skip that.

I don't really understand the other 2 paragraphs, what's this "dark state-of-the-art computer science" you speak of and what is this "area of research became highly developed over two decades but is not in the public literature" how is "the computer science required is so different from classical AI research"?

calf 2 days ago | parent [-]

Above commenter also asserts "highly developed research but no public literature" shrug ...

jandrewrogers 2 days ago | parent | next [-]

It was a national security program that plenty of people are familiar with and has been used across several countries. None of those programs publish.

As much as the literature doesn’t exist, the tech has been used in production for over a decade. That’s just my word of course but a lot of people know. :shrug:

calf a day ago | parent [-]

This is as rhetorically valid as knowing about UFOs, it smells of crackpot science.

fu-hn 2 days ago | parent | prev [-]

But the best minds in the world said so!

dopadelic 2 days ago | parent | prev | next [-]

You're pointing out a real class of hard problems — modeling sparse, nonlinear, spatiotemporal systems — but there’s a fundamental mischaracterization in lumping all transformer-based models under “LLMs” and using that to dismiss the possibility of spatial reasoning.

Yes, classic LLMs (like GPT) operate as sequence predictors with no inductive bias for space, causality, or continuity. They're optimized for language fluency, not physical grounding. But multimodal models like ViT, Flamingo, and Perceiver IO are a completely different lineage, even if they use transformers under the hood. They tokenize images (or video, or point clouds) into spatially-aware embeddings and preserve positional structure in ways that make them far more suited to spatial reasoning than pure text LLMs.

The supposed “impedance mismatch” is real for language-only models, but that’s not the frontier anymore. The field has already moved into architectures that integrate vision, text, and action. Look at Flamingo's vision-language fusion, or GPT-4o’s real-time audio-visual grounding — these are not mere LLMs with pictures bolted on. These are spatiotemporal attention systems with architectural mechanisms for cross-modal alignment.

You're also asserting that "no general-purpose representations of space exist" — but this neglects decades of work in computational geometry, graphics, physics engines, and more recently, neural fields and geometric deep learning. Sure, no universal solution exists (nor should we expect one), but practical approximations exist: voxel grids, implicit neural representations, object-centric scene graphs, graph neural networks, etc. These aren't perfect, but dismissing them as non-existent isn’t accurate.

Finally, your concern about who on the team understands these deep theoretical issues is valid. But the fact is: theoretical CS isn’t the bottleneck here — it’s scalable implementation, multimodal pretraining, and architectural experimentation. If anything, what we need isn’t more Solomonoff-style induction or clever data structures — it’s models grounded in perception and action.

The real mistake isn’t that people are trying to cram physical reasoning into LLMs. The mistake is in acting like all transformer models are LLMs, and ignoring the very active (and promising) space of multimodal models that already tackle spatial, embodied, and dynamical reasoning problems — albeit imperfectly.

mumbisChungo 2 days ago | parent | next [-]

Claude, is that you?

calf 2 days ago | parent | prev [-]

How do we prove a trained LLM has no inductive bias for space, causality, etc.? We can't assume this is true by construction, can we?

dopadelic 2 days ago | parent [-]

Why would we need to prove such a thing? Human vision has strong inductive biases, which is why you can perceive objects in abstract patterns. This is why you can lay down at a park and see a duck in a cloud. It's also why we can create abstracted representations of things with graphics. Having inductive biases makes it more relatable to the way we work.

And again, you're using the term LLMs again when vision based transformers in multimodal models aren't simply LLMs.

calf a day ago | parent [-]

You said that classic LLMs have no inductive bias for causality. So I am simply asking if any computer scientist has actually proved that. Otherwise it is just a fancy way of saying "LLMs can't reason, they are just stochastic parrots". AFAIK not every computer scientist shares that consensus. So to use that claim is to potentially smuggle in an assumption that is not scientifically settled. That's why I specifically asked about this claim which you made a few paragraphs into your response to the parent commenter.

ccozan 3 days ago | parent | prev | next [-]

I agree that the problem is hard. However, biological brain is able to handle it quite "easily" ( is not really easy - bilions of iterations were needed ). The current brains are solving this 3D physical world _only_ via perception.

So this is place were we must look. It starts with the sensing and the integration of that sensing. I am working at this problem since more than 10 years and I came to some results. I am not a real scientist but a true engineer and I am looking from that perspective quite intesely: The question that one must ask is: how do you define the outside physical world from the perspective of a biological sensing "device" ? what exactly are we "seeing" or "hearing"? So yes, working on that brought it further in defining the physical world.

tmilard 2 days ago | parent | next [-]

I do agree with you. We have an natural eye (what you call a 'biological brain') automat that inconsciouly 'feels' the structure of a geometric of the places we enter to.

Once this layer of "natural eye automat" is programmed behind a camera, it will spit out this crude geometry : the Spacial-data-bulk (SDB). This SDB is small data.

From now on, our programs will only do reason, not on data froms camera(s) but only on this small SBD.

This is how I see it.

tmilard 2 days ago | parent [-]

==> And now the LLMs, to feel Spacial knowledge, will have a very reduce dataset. This will make spacial data reasoning very less intencive than we can't imagine.

foobarian 2 days ago | parent | prev | next [-]

Maybe a brute force solution would work just like it did for text. I would not be surprised if the scale of that brute force was not within reach yet though.

andoando 2 days ago | parent | prev [-]

Also a cook here whose spent years thinking about this, would love to hear about what results you've obtained

voxleone 2 days ago | parent | prev | next [-]

I'm trying to approach spatial reasoning by introducing quaternions to navigate graphs. It is a change in the unit of traversal — from positional increment to rotational progression. This reframing has cascading effects. It alters how we model motion, how we think about proximity, and ultimately how systems relate to space itself.

The traditional metaphor of movement — stepping from point A to point B — is spatially intuitive but semantically impoverished. It ignores the continuity of direction, the embodiment of motion, and the nontriviality of turning. Quaternion-based traversal reintroduces these elements. It is not just more precise; it is more faithful to the mechanisms by which physical and virtual entities evolve through space. In other words objects 'become' the model.

https://github.com/VoxleOne/SpinStep/blob/main/docs/index.md

niemandhier 3 days ago | parent | prev | next [-]

Regarding sparse, nonlinear systems and our ability to learn them:

There is hope. Experimental observation is, that in most cases the coupled high dimensional dynamics almost collapses to low dimensional attractors.

The interesting thing about these is: If we apply a measurement function to their state and afterwards reconstruct a representation of their dynamics from the measurement by embedding, we get a faithful representation of the dynamics with respect to certain invariants.

Even better, suitable measurement functions are dense in function space so we can pick one at random and get a suitable one with probability one.

What can be glanced about the dynamics in terms of of these invariants can learned for certain, experience shows that we can usually also predict quite well.

There is a chain of embedding theorems by Takens and Sauer gradually broadening the scope of applicability from deterministic chaos towards stochasticly driven deterministic chaos.

Note embedding here is not what current computer science means by the word.

I spend most of my early adulthood doing theses things, would be cool to see them used once more.

golol 2 days ago | parent [-]

What field of mathematics is this? Can you point me to some keywords/articles?

amelius 2 days ago | parent | prev | next [-]

Some types of deep learning model can handle 3d data quite well:

https://en.wikipedia.org/wiki/Neural_radiance_field

idiotsecant 3 days ago | parent | prev | next [-]

If there's one thing that control theory has taught us in the last 100 years, it's that anything is linear if you zoom in far enough. Nonlinearity is practically solvable by adjusting your controls to different linear models depending on your position in the system space.

gyomu 3 days ago | parent | prev | next [-]

> There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.

Where can I read more about this space? (particularly on the "we know that universal solutions can't exist" front)

ryeguy_24 2 days ago | parent | prev | next [-]

Agree. Also, with respect to training, what is the goal that we are maximizing? LLMs are easy, predicting the next word and we have lots of training data. But what are we training for in real world? Modeling the next spatial photograph to predict things that will happen next? It’s not intuitive to me what that objective function would be in spatial intelligence.

kadushka 2 days ago | parent | next [-]

Why wouldn’t predicting the next frame in a video stream be as effective as predicting the next word?

curiouscavalier 2 days ago | parent | prev [-]

Or that there is a sufficiently generalizable objective function for all “spatial intelligence.”

epr 2 days ago | parent | prev | next [-]

Human beings get by quite well with extremely oversimplified (low resolution) abstractions. There is no need whatsoever for something even approaching universal or perfect. Humans aren't thinking about fundamental particles or solving differential equations in their head when they're driving a car or playing sports.

3 days ago | parent | prev | next [-]
[deleted]
mindcrime 3 days ago | parent | prev | next [-]

> became highly developed over two decades but is not in the public literature.

Developed by who? And for what purpose? Are we talking about overlap with stuff like missile guidance systems or targeting control systems or something, and kept confidential by the military-industrial complex? I'm having a hard time seeing many other scenarios that would explain a large body of people doing research in this area and then not publishing anything.

> I can’t identify anyone on the team that is an expert in a relevant area of computer science theory

Who is an expert on this theory then?

CamperBob2 2 days ago | parent | prev | next [-]

We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch.

Then again, not much that we "knew" a decade ago is still relevant today. Of course transformer networks have proven capable of representing spatial intelligence. How could they work with 2D images, if not?

2 days ago | parent | prev | next [-]
[deleted]
queuebert 2 days ago | parent | prev | next [-]

I think you are using "sparse" and "non-linear" as scare terms. Sparse is a good thing, as it reduces degrees of freedom, and non-linear does not mean unsolvable.

Also "impedance mismatch" doesn't mean no go, but rather less efficient.

vrighter a day ago | parent | prev | next [-]

ah but neural networks are universal function approximators! (proceeds to ignore the size of network needed for an adequate approximation and/or how much data would be required to train it)

nurettin 2 days ago | parent | prev | next [-]

> how do you represent arbitrary spatial relationships on computers in a general and scalable way?

Isn't this essentially what the convolutional layers do in LeNet?

andoando 2 days ago | parent | prev | next [-]

What's non linear about spatial reasoning?

>We know that universal solutions can’t exist

Why not?

randcraw 2 days ago | parent [-]

Spatial models must be 3D, not 1D (linear), much less 2D, which is sufficient for images and object recognition (where models are not needed). And adding time makes it 4D, at least for robot motion.

To reason spatially (and dynamically) the dependence of one object's position in space on other objects (and their motions and behaviors) adds up fast to complicate the model in ways that 95% of 2D static image analysis does not.

andoando 2 days ago | parent [-]

Well hold on, first Im not convinced we have solved 2D spatial intelligence. Analyzing 2D images is very different from being able to reason about 2D geometry. How do you mathematically define relations like "above", "below", "diagonal", etc in a composable way that can be learned?

Second, problems in 3D can be deconstructed to 2D. For example, how do you get to the airport? You need to first solve the 2D overview of the path youd take as youd now looking at a map. Then you need to reason about your field of view, and here again I believe youre really reasoning is something like "object A is behind object B and A is to the left of B", and not solving some non linear equation

I think a big issue is people are trying to solve this in the realm of traditional mathematics, and not as a simple step by step process

doctorpangloss 2 days ago | parent | prev | next [-]

pretty esoteric when it's so simple: you either think

Bill Peebles is right, naturalistic, physical laws can be learned in deep neural nets from videos.

OR

Fei-Fei Li is right, you need 3D point cloud videos.

Okay, if you think Bill Peebles is right, then all this stuff you are talking about doesn't matter anymore. Lots of great reasons Bill Peebles is probably right, biggest reason of all is that Veo, Sora etc. have really good physics understanding.

If you think Fei-Fei Li is right, you are going to be augmenting real world data sets with game engine content. You can exactly create whatever data you need, for whatever constraints, to train performantly. I don't think this data scalability concern is real.

A compelling reason you are wrong and Fei-Fei Li's specific bet on scalability is right is the existence of Waymo and Zoox. There are also NEW autonomous vehicle companies achieving things faster than Zoox and Waymo did, because a lot of spatial intelligence problems are actually regulatory/political, not scientific.

adamnemecek 3 days ago | parent | prev | next [-]

All (ALL!!) AI/optimization problems boil down to energy minimization or dually entropy maximization.

TacticalCoder 2 days ago | parent | prev | next [-]

[dead]

2 days ago | parent | prev [-]
[deleted]