> If 'understand' is a meaningless term to someone who's spent 30 years in AI research, I understand why LLMs are being sold and hyped in the way they are.

I don't have quite as much time as robotresearcher, but I've heard their sentiment frequently.

I've been to conferences, talked with people at the top of the field (I'm "junior", but published and have a PhD) where when asking deeper questions I'll get a frequent response "I just care if it works." As if that also wasn't the motivation for my questions too.

But I'll also tell you that there are plenty of us who don't ascribe to those beliefs. There's a wide breadth of opinions, even if one set is large and loud. (We are getting louder though) I do think we can get to AGI and I do think we can figure out what words like "understand" truly mean (with both accuracy and precision, the latter being what's more lacking). But it is also hard to navigate because we're discouraged from this work and little funding flows our way (I hope as we get louder we'll be able to explore more, but I fear we may switch from one railroad to the next). The weirdest part to me has been that it seems that even in the research space, talking to peers, that discussing flaws or limits is treated as dismissal. I thought our whole job was to find the limits, explore them, and find ways to resolve them.

The way I see it now is that the field uses the duck test. If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. The problem is people are replacing "probably" with "is". The duck test is great, and right now we don't have anything much better. But the part that is insane is to call it perfect. Certainly as someone who isn't an ornithologist, I'm not going to be able to tell a sophisticated artificial duck from a real one. But it's ability to fool me doesn't make it real. And that's exactly why it would be foolish to s/probably/is.

So while I think you're understanding correctly, I just want to caution throwing the baby out with the bathwater. The majority of us dissenting from the hype train and "scale is all you need" don't believe humans are magic and operating outside the laws of physics. Unless this is a false assumption, artificial life is certainly possible. The question is just about when and how. I think we still have a ways to go. I think we should be exploring a wide breadth of ideas. I just don't think we should put all our eggs in one basket, especially if there's clear holes in it.

[Side note]: An interesting relationship I've noticed is that the hype train people tend to have a full CS pedigree while dissenters have mixed (and typically start in something like math or physics and make their way to CS). It's a weak correlation, but I've found it interesting.

▲ hodgehog11 a day ago | parent | next [-]

As a mathematician who also regularly publishes in these conferences, I am a little surprised to hear your take; your experience might be slightly different to mine.

Identifying limitations of LLMs in the context of "it's not AGI yet because X" is huge right now; it gets massive funding, taking away from other things like SciML and uncertainty analyses. I will agree that deep learning theory in the sense of foundational mathematical theory to develop internal understanding (with limited appeal to numerics) is in the roughest state it has even been in. My first impression there is that the toolbox has essentially run dry and we need something more to advance the field. My second impression is that empirical researchers in LLMs are mostly junior and significantly less critical of their own work and the work of others, but I digress.

I also disagree that we are disincentivised to find meaning behind the word "understanding" in the context of neural networks: if understanding is to build an internal world model, then quite a bit of work is going into that. Empirically, it would appear that they do, almost by necessity.

▲ godelski a day ago | parent [-]

Maybe given our different niches we interact with different people? But I'm uncertain because I believe what I'm saying is highly visible. I forgot, which NeurIPS(?) conference were so many wearing "Scale is all you need" shirts?

  > My first impression there is that the toolbox has essentially run dry and we need something more to advance the field

This is my impression too. Empirical evidence is a great tool and useful, especially when there is no strong theory to provide direction, but it is limited.

  > My second impression is that empirical researchers in LLMs are mostly junior and significantly less critical of their own work and the work of others

But this is not my impression. I see this from many prominent researchers. Maybe they claim SIAYN in jest, but then they should come out and say it is such instead of doubling down. If we take them at their word (and I do), robotresearcher is not a junior (please, read their comments. It is illustrative of my experience. I'm just arguing back far more than I would in person). I've also seen members of audiences to talks where people ask questions like mine ("are benchmarks sufficient to make such claims?") with responses of "we just care that it works." Again, I think this is a non-answer to the question. But being taken as a sufficient answer, especially in response to peers, is unacceptable. It almost always has no follow-up.

I also do not believe these people are less critical. I've had several works which struggled through publication as my models that were a hundredth the size (and a millionth the data) could perform on par, or even better. At face value asks of "more datasets" and "more scale" are reasonable, yet it is a self reinforcing paradigm where it slows progress. It's like a corn farmer smugly asking why the neighboring soy bean farmer doesn't grow anything when the corn farmer is chopping all the soy bean stems in their infancy. It is a fine ask to big labs with big money, but it is just gate keeping and lazy evaluation to anyone else. Even at CVPR this last year they passed out "GPU Rich" and "GPU Poor" hats, so I thought the situation was well known.

  > if understanding is to build an internal world model, then quite a bit of work is going into that. Empirically, it would appear that they do, almost by necessity.

I agree a "lot of work is going into it" but I also think the approaches are narrow and still benchmark chasing. I saw as well was given the aforementioned responses at workshops on world modeling (as well as a few presenters who gave very different and more complex answers or "it's the best we got right now", but nether seemed to confident in claiming "world model" either).

But I'm a bit surprised that as a mathematician you think these systems create world models. While I see some generalization, this is also impossible for me to distinguish from memorization. We're processing more data than can be scrutinized. We seem to also frequently uncover major limitations to our de-duplication processes[0]. We are definitely abusing the terms "Out of Distribution" and "Zero shot". Like I don't know how any person working with a proprietary LLM (or large model) that they don't own, can make a claim of "zero shot" or even "few shot" capabilities. We're publishing papers left and right, yet it's absurd to claim {zero,few}-shot when we don't have access to the learning distribution. We've merged these terms with biased sampling. Was the data not in training or is it just a low likelihood region of the model? They're indistinguishable without access to the original distribution.

Idk, I think our scaling is just making the problem harder to evaluate. I don't want to stop that camp because they are clearly producing things of value, but I do also want that camp to not make claims beyond their evidence. It just makes the discussion more convoluted. I mean the argument would be different if we were discussing small and closed worlds, but we're not. The claims are we've created world models yet many of them are not self-consistent. Certainly that is a requirement. I admit we're making progress, but the claims were made years ago. Take GameNGen[1] or Diamond Diffusion. Neither were the first and neither were self-consistent. Though both are also impressive.

[0] as an example: https://arxiv.org/abs/2303.09540

[1] https://news.ycombinator.com/item?id=41375548

[2] https://news.ycombinator.com/item?id=41826402

▲ hodgehog11 a day ago | parent [-]

Apologies if I ramble a bit here, this was typed in a bit of a hurry. Hopefully I answer some of your points.

First, regarding robotresearcher and simondota's comments, I am largely in agreement with what they say here. The "toaster" argument is a variant of the Chinese Room argument, and there is a standard rebuttal here. The toaster does not act independently of the human so it is not a closed system. The system as a whole, which includes the human, does understand toast. To me, this is different from the other examples you mention because the machine was not given a list of explicit instructions. (I'm no philosopher though so others can do a better job of explaining this). I don't feel that this is an argument for why LLMs "understand", but rather why the concept of "understanding" is irrelevant without an appropriate definition and context. Since we can't even agree on what constitutes understanding, it isn't productive to frame things in those terms. I guess that's where my maths background comes in, as I dislike the ambiguity of it all.

My "mostly junior" comment is partially in jest, but mostly comes from the fact that LLM and diffusion model research is a popular stream for moving into big tech. There are plenty of senior people in these fields too, but many reviewers in those fields are junior.

> I've also seen members of audiences to talks where people ask questions like mine ("are benchmarks sufficient to make such claims?") with responses of "we just care that it works."

This is a tremendous pain point to me more than I can convey here, but it's not unusual in computer science. Bad researchers will live and die on standard benchmarks. By the way, if you try to focus on another metric under the argument that the benchmarks are not wholly representative of a particular task, expect to get roasted by reviewers. Everyone knows it is easier to just do benchmark chasing.

> I also do not believe these people are less critical.

I think the fact that the "we just care that it works" argument is enough to get published is a good demonstration of what I'm talking about. If "more datasets" and "more scale" are the major types of criticisms that you are getting, then you are still working in a more fortunate field. And yes, I hate it as much as you do as it does favor the GPU rich, but they are at least potentially solvable. The easiest papers of mine to get through were methodological and often got these kinds of comments. Theory and SciML papers are an entirely different beast in my experience because you will rarely get reviewers that understand the material or care about its relevance. People in LLM research thought that the average NeurIPS score in the last round was a 5. Those in theory thought it was 4. These proportions feel reflected in the recent conferences. I have to really go looking for something outside the LLM mainstream, while there was a huge variety of work only a few years ago. Some of my colleagues have noticed this as well and have switched out of scientific work. This isn't unnatural or something to actively try to fix, as ML goes through these hype phases (in the 2000s, it was all kernels as I understand).

> approaches are narrow and still benchmark chasing > as a mathematician you think these systems create world models

When I say "world model", I'm not talking about outputs or what you can get through pure inference. Training models to perform next frame prediction and looking at inconsistencies in the output tells us little about the internal mechanism. I'm talking about appropriate representations in a multimodal model. When it reads a given frame, is it pulling apart features in a way that a human would? We've known for a long time that embeddings appropriately encode relationships between words and phrases. This is a model of the world as expressed through language. The same thing happens for images at scale as can be seen in interpretable ViT models. We know from the theory that for next frame prediction, better data and more scaling improves performance. I agree that isn't very interesting though.

> We are definitely abusing the terms "Out of Distribution" and "Zero shot".

Absolutely in agreement with everything you have said. These are not concepts that should be talked about in the context of "understanding", especially at scale.

> I think our scaling is just making the problem harder to evaluate.

Yes and no. It's clear that whatever approach we will use to gauge internal understanding needs to work at scale. Some methods only work with sufficient scale. But we know that completely black-box approaches don't work, because if they did, we could use them on humans and other animals.

> The claims are we've created world models yet many of them are not self-consistent.

For this definition of world model, I see this the same way as how we used to have "language models" with poor memory. I conjecture this is more an issue of alignment than a lack of appropriate representations of internal features, but I could be totally wrong on this.

▲ godelski a day ago | parent [-]

  > The toaster does not act independently of the human so it is not a closed system

I think you're mistaken. No, not at that, at the premise. I think everyone agrees here. Where you're mistaken is that when I login to Claude it says "How can I help you today?"

No one is thinking that the toaster understands things. We're using it to point out how silly the claim of "task performance == understanding" is. Techblueberry furthered this by asking if the toaster is suddenly intelligent by wrapping it with a cron job. My point was about where the line is drawn. The turning on the toaster? No, that would be silly and you clearly agree. So you have to answer why the toaster isn't understanding toast. That's the ask. Because clearly toaster toasts bread.

You and robotresearcher have still avoided answering this question. It seems dumb but that is the crux of the problem. The LLM is claimed to be understanding, right? It meets your claims of task performance. But they are still tools. They cannot act independently. I still have to prompt them. At an abstract level this is no different than the toaster. So, at what point does the toaster understand how to toast? You claim it doesn't, and I agree. You claim it doesn't because a human has to interact with it. I'm just saying that looping agents onto themselves doesn't magically make them intelligent. Just like how I can automate the whole process from planting the wheat to toasting the toast.

You're a mathematician. All I'm asking is that you abstract this out a bit and follow the logic. Clearly even our automated seed to buttered toast on a plate machine needs not have understanding.

From my physics (and engineering) background there's a key thing I've learned: all measurements are proxies. This is no different. We don't have to worry about this detail in most every day things because we're typically pretty good at measuring. But if you ever need to do something with precision, it becomes abundantly obvious. But you even use this same methodology in math all the time. Though I wouldn't say that this is equivalent to taking a hard problem, creating an isomorphic map to an easier problem, solving it, then mapping back. There's an invective nature. A ruler doesn't measure distance. A ruler is a reference to distance. A laser range finder doesn't measure distance either, it is photodetector and a timer. There is nothing in the world that you can measure directly. If we cannot do this with physical things it seems pretty silly to think we can do it with abstract concepts that we can't create robust definitions for. It's not like we've directly measured the Higgs either. But what, do you think entropy is actually a measurement of intelligible speech? Perplexity is a good tool for identifying an entropy minimizer? Or does it just correlate? Is a FID a measurement of fidelity or are we just using a useful proxy? I'm sorry, but I just don't think there are precise mathematical descriptions of things like natural English language or realistic human faces. I've developed some of the best vision models out there and I can tell you that you have to read more than the paper because while they will produce fantastic images they also produce some pretty horrendous ones. The fact that they statistically generate realistic images does not imply that they actually understand them.

  > I'm no philosopher

Why not? It sounds like you are. Do you not think about metamathematics? What math means? Do you not think about math beyond the computation? If you do, I'd call you a philosopher. There's a P in a PhD for a reason. We're not supposed to be automata. We're not supposed to be machine men, with machine minds, and machine hearts.

  > This is a tremendous pain point ... researchers will live and die on standard benchmarks.

It is a pain we share. I see it outside CS as well, but I was shocked to see the difference. Most of the other physicists and mathematicians I know that came over to CS were also surprised. And it isn't like physicists are known for their lack of egos lol

  > then you are still working in a more fortunate field

Oh, I've gotten the other comments too. That research never found publication and at the end of the day I had to graduate. Though now it can be revisited. I once was surprise to find that I saved a paper from Max Welling's group. My fellow reviewers were confident in their rejections just since they admitted to not understanding differential equations the AC sided with me (maybe they could see Welling's name? I didn't know till months after). It barely got through a workshop, but should have been in the main proceedings.

So I guess I'm saying I share this frustration. It's part of the reason I talk strongly here. I understand why people shift gears. But I think there's a big difference between begrudgingly getting on the train because you need to publish to survive and actively fueling it and shouting that all outer trains are broken and can never be fixed. One train to rule them all? I guess CS people love their binaries.

  > world model

I agree that looking at outputs tells us little about their internal mechanisms. But proof isn't symmetric in difficulty either. A world model has to be consistent. I like vision because it gives us more clues in our evaluations, let's us evaluate beyond metrics. But if we are seeing video from a POV perspective, then if we see a wall in front of us, turn left, then turn back we should still expect to see that wall, and the same one. A world model is a model beyond what is seen from the camera's view. A world model is a physics model. And I mean /a/ physics model, not "physics". There is no single physics model. Nor do I mean that a world model needs to have even accurate physics. But it does need to make consistent and counterfactual predictions. Even the geocentric model is a world model (literally a model of worlds lol). The model of the world you have in your head is this. We don't close our eyes and conclude the wall in front of you will disappear. Someone may spin you around and you still won't do this, even if you have your coordinates wrong. The issue isn't so much memory as it is understanding that walls don't just appear and disappear. It is also understanding that this also isn't always true about a cat.

I referenced the game engines because while they are impressive they are not self consistent. Walls will disappear. An enemy shooting at you will disappear sometimes if you just stop looking at it. The world doesn't disappear when I close my eyes. A tree falling in a forest still creates acoustic vibrations in the air even if there is no one to hear it.

A world model is exactly that, a model of a world. It is a superset of a model of a camera view. It is a model of the things in the world and how they interact together, regardless of if they are visible or not. Accuracy isn't actually the defining feature here, though it is a strong hint, at least it is for poor world models.

I know this last part is a bit more rambly and harder to convey. But I hope the intention came across.

▲

robotresearcher 13 hours ago | parent [-]

> You and robotresearcher have still avoided answering this question.

I have repeatedly explicitly denied the meaningfulness of the question. Understanding is a property ascribed by an observer, not possessed by a system.

You may not agree, but you can’t maintain that I’m avoiding that question. It does not have an answer that matters; that is my specific claim.

You can say a toaster understands toasting or you can not. There is literally nothing at stake there.

▲

godelski 12 hours ago | parent [-]

You said the LLMs are intelligent because they do tasks. But the claim is inconsistent with the toaster example.

If a toaster isn't intelligent because I have to give it bread and press the button to start then how's that any different from giving an LLM a prompt and pressing the button to start?

It's never been about the toaster. You're avoiding answering the question. I don't believe you're dumb, so don't act the part. I'm not buying it.

	▲	robotresearcher 11 hours ago \| parent [-]
		I didn’t describe anything as intelligent or not intelligent. I’ll bow out now. Not fun to be ascribed views I don’t have, despite trying to be as clear as I can.

▲ robotresearcher a day ago | parent | prev [-]

Intellectual caution is a good default.

Having said that, can you name one functional difference between an AI that understands, and one that merely behaves correctly in its domain of expertise?

As an example, how would a chess program that understands chess differ from one that is merely better at it than any human who ever lived?

(Chess the formal game; not chess the cultural phenomenon)

Some people don’t find the example satisfying, because they feel like chess is not the kind of thing where understanding pertains.

I extend that feeling to more things.

▲ godelski a day ago | parent [-]

  > any human who ever lived

Is this falsifiable? Even restricting to those currently living? On what tests? In which way? Does the category of error matter?

  > can you name one functional difference between an AI that understands, and one that merely behaves correctly in its domain of expertise?

I'd argue you didn't understand the examples from my previous comment or the direct reply[0]. Does it become a duck as soon as you are able to trick an ornithologist? All ornithologists?

But yes. Is it fair if I use Go instead of Chess? Game 4 with Lee Sedol seems an appropriate example.

Vafa also has some good examples[1,2].

But let's take an even more theoretical approach. Chess is technically a solved game since it is non-probabilistic. You can compute an optimal winning strategy from any valid state. Problem is it is intractable since the number of action state pairs is so large. But the number of moves isn't the critical part here, so let's look at Tic-Tac-Toe. We can pretty easily program up a machine that will not lose. We can put all actions and states into a graph and fit that on a computer no problem. Do you really say that the program better understands Tic-Tac-Toe than a human? I'm not sure we should even say it understands the game at all.

I don't think the situation is resolved by changing to unsolved (or effectively unsolved) games. That's the point of the Heliocentric/Geocentric example. The Geocentric Model gave many accurate predictions, but I would find it surprising if you suggested an astronomer at that time, with deep expertise in the subject, understood the configuration of the solar system better than a modern child who understands Heliocentricism. Their model makes accurate predictions and certainly more accurate than that child would, but their model is wrong. It took quite a long time for Heliocentrism to not just be proven to be correct, but to also make better predictions than Geocentrism in all situations.

So I see 2 critical problems here.

1) The more accurate model[3] can be less developed, resulting in lower predictive capabilities despite being a much more accurate representation of the verifiable environment. Accuracy and precision are different, right?

2) Test performance says nothing about coverage/generalization[4]. We can't prove our code is error free through test cases. We use them to bound our confidence (a very useful feature! I'm not against tests, but as you say, caution is good).

In [0] I referenced Dyson, I'd appreciate it if you watched that short video (again if it's been some time). How do you know you aren't making the same mistake Dyson almost did? The mistake he would have made had he not trusted Fermi? Remember, Fermi's predictions were accurate and they even stood for years.

If your answer is time, then I'm not convinced it is a sufficient explanation. It doesn't explain Fermi's "intuition" (understanding) and is just kicking the can down the road. You wouldn't be able to differentiate yourself from Dyson's mistake. So why not take caution?

And to be clear, you are the one making the stronger claim: "understanding has a well defined definition." My claim is that yours is insufficient. I'm not claiming I have an accurate and precise definition, my claim is that we need more work to get the precision. I believe your claim can be a useful abstraction (and certainly has been!), but that there are more than enough problems that we shouldn't hold to it so tightly. To use it as "proof" is naive. It is equivalent to claiming your code is error free because it passes all test cases.

[0] https://news.ycombinator.com/item?id=45622156

[1] https://arxiv.org/abs/2406.03689

[2] https://arxiv.org/abs/2507.06952

[3] Certainly placing the Earth at the center of the solar system (or universe!) is a larger error than placing the sun at the center of the solar system and failing to predict the tides or retrograde motion of Mercury.

[4] This gets exceedingly complex as we start to differentiate from memorization. I'm not sure we need to dive into what the distance from some training data needs be to make it a reasonable piece of test data, but that is a question that can't be ignored forever.

▲ robotresearcher 14 hours ago | parent [-]

>> any human who ever lived > Is this falsifiable? Even restricting to those currently living? On what tests? In which way? Does the category of error matter?

Software reliably beats the best players that have ever played it in public, including Kasparov and Carlsen, the best players of my lifetime (to my limited knowledge). By analogy to the performance ratchet we see in the rest of sports and games, and we might reasonably assume that these dominant living players are the best the world has ever seen. That could be wrong. But my argument does not hang on this point, so asking about falsifiability here doesn't do any work. Of course it's not falsifiable.

Y'know what else is not falsifiable? "That AI doesn't understand what it's doing".

  > can you name one functional difference between an AI that understands, and one that merely behaves correctly in its domain of expertise?

> I'd argue you didn't understand the examples from my previous comment or the direct reply[0]. Does it become a duck as soon as you are able to trick an ornithologist? All ornithologists?

No one seems to have changed their opinion about anything in the wake of AIs routinely passing the Turing Test. They are fooled by the chatbot passing as a human, and then ask about ducks instead. The most celebrated and seriously considered quacks like a duck argument has been won by the AIs and no-one cares.

By the way, the ornithologists' criteria for duck is probably genetic and not much to do with behavior. A dead duck is still a duck.

And because we know what a duck is, no-one is yelling at ducks that 'they don't really duck' and telling duck makers they need a revolution in duck making and they are doomed to failure if they don't listen.

Not so with 'understanding'.

▲ godelski 12 hours ago | parent [-]

  > Y'know what else is not falsifiable? "That AI doesn't understand what it's doing".

Which is why people are saying we need to put in more work to define this term. Which is the whole point of this conversation.

  > seriously considered quacks like a duck argument has been won by the AIs and no-one cares.

And have you ever considered that it's because people are refining their definitions?

Often when people find that their initial beliefs are wrong or not precise enough then they update their beliefs. You seem to be calling this a flaw. It's not like the definitions are dramatically changing, they're refining. There's a big difference

	▲	robotresearcher 11 hours ago \| parent [-]
		My first post here is me explaining that I have a non-standard definition of what ‘understanding’ means, which helps me avoid an apparently thorny issue. I’m literally here offering a refinement of a definition. This is a weird conversation.