Remix.run Logo
libraryofbabel 10 days ago

This essay could probably benefit from some engagement with the literature on “interpretability” in LLMs, including the empirical results about how knowledge (like addition) is represented inside the neural network. To be blunt, I’m not sure being smart and reasoning from first principles after asking the LLM a lot of questions and cherry picking what it gets wrong gets to any novel insights at this point. And it already feels a little out date, with LLMs getting gold on the mathematical Olympiad they clearly have a pretty good world model of mathematics. I don’t think cherry-picking a failure to prove 2 + 2 = 4 in the particular specific way the writer wanted to see disproves that at all.

LLMs have imperfect world models, sure. (So do humans.) That’s because they are trained to be generalists and because their internal representations of things are massively compressed single they don’t have enough weights to encode everything. I don’t think this means there are some natural limits to what they can do.

yosefk 10 days ago | parent | next [-]

Your being blunt is actually very kind, if you're describing what I'm doing as "being smart and reasoning from first principles"; and I agree that I am not saying something very novel, at most it's slightly contrarian given the current sentiment.

My goal is not to cherry-pick failures for its own sake as much as to try to explain why I get pretty bad output from LLMs much of the time, which I do. They are also very useful to me at times.

Let's see how my predictions hold up; I have made enough to look very wrong if they don't.

Regarding "failure disproving success": it can't, but it can disprove a theory of how this success is achieved. And, I have much better examples than the 2+2=4, which I am citing as something that sorta works these says

WillPostForFood 8 days ago | parent | next [-]

Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.

https://imgur.com/a/O9CjiJY

yosefk 3 days ago | parent | next [-]

The examples are from the latest versions of ChatGPT, Claude, Grok, and Google AI Overview. I did not bother to list the full conversations because (A) LLMs are very verbose and (B) nothing ever reproduces, so in any case any failure is "abnormally bad." I guess dismissing failures and focusing on successes is a natural continuation of our industry's trend to ship software with bugs which allegedly don't matter because they're rare, except with "AI" the MTBF is orders of magnitude shorter

typpilol 8 days ago | parent | prev | next [-]

This seems like a common theme with these types of articles

eru 8 days ago | parent [-]

Perhaps the people who get decent answers don't write articles about them?

ehnto 8 days ago | parent [-]

I imagine people give up silently more often than they write a well syndicated article about it. The actual adoption and efficiencies we see in enterprises will be the most verifiable data on if LLMs are generally useful in practice. Everything so far is just academic pontificating or anecdata from strangers online.

eru 8 days ago | parent | next [-]

I am inclined to agree.

However, I'm not completely sure. Eg object oriented programming was basically a useless fad full of empty, never-delivered-on promises, but software companies still lapped it up. (If you happen to like OOP, you can probably substitute your own favourite software or wider management fad.)

Another objection: even an LLM with limited capabilities and glaring flaws can still be useful for some commercial use-cases. Eg the job of first line call centre agents that aren't allowed to deviate from a fixed script can be reasonable automated with even a fairly bad LLM.

Will it suck occasionally? Of course! But so does interacting with the humans placed into these positions without authority to get anything done for you. So if the bad LLM is cheaper, it might be worthwhile.

libraryofbabel 8 days ago | parent | prev [-]

This. I think we’ve about reached the limit of the usefulness of anecdata “hey I asked an LLM this this and this” blog posts. We really need more systematic large scale data and studies on the latest models and tools - the recent one on cursor (which had mixed results) was a good start but it was carried out before Claude Code was even released, i.e. prehistoric times in terms of AI coding progress.

For my part I don’t really have a lot of doubts that coding agents can be a useful productivity boost on real-world tasks. Setting aside personal experience, I’ve talked to enough developers at my company using them for a range of tickets on a large codebase to know that they are. The question is more, how much: are we talking a 20% boost, or something larger, and also, what are the specific tasks they’re most useful on. I do hope in the next few years we can get some systematic answers to that as an industry, that go beyond people asking LLMs random things and trying to reason about AI capabilities from first principles.

marcellus23 8 days ago | parent | prev [-]

I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.

ehnto 8 days ago | parent | next [-]

When talking about the capabilities of a class of tools long term, it makes sense to be general. I think deriving conclusions at all is pretty difficult given how fast everything is moving, but there is some realities we do actually know about how LLMs work and we can talk about that.

Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

dpoloncsak 6 days ago | parent [-]

> Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

Isnt that exactly what is going to help us understand the value these tools bring to end-users, and how to optimize these tools for better future use? None of these models are copy+pastes, they tend to be doing things slightly differently under the hood. How those differences affect results seems like the exact data we would want here

ehnto 6 days ago | parent [-]

I guess I disagree that the main concern is the differences per each model, rather than the overall technology of LLMs in general. Given how fast it's all changing, I would rather focus on the broader conversation personally. I don't really care if GPT5 is better at benchmarks, I care that LLMs are actually capable of the type of reasoning and productive output that the world currently thinks they are.

marcellus23 5 days ago | parent [-]

Sure, but if you're making a point about LLMs in general, you need to use examples from best-in-class models. Otherwise your examples of how these models fail are meaningless. It would be like complaining about how smartphone cameras are inherently terrible, but all your examples of bad photos aren't labeled with what phone was used to capture. How can anyone infer anything meaningful from that?

p1esk 8 days ago | parent | prev [-]

Yes, I’d be curious about his experience with GPT-5 Thinking model. So far I haven’t seen any blunders from it.

eru 8 days ago | parent [-]

I've seen plenty of blunders, but in general it's better than their previous models.

Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.

grey-area 8 days ago | parent [-]

In a very real sense it doesn’t even know that it doesn’t know.

eru 8 days ago | parent [-]

Maybe. But in math you can either produce the proof (with each step checkable) or you can't.

libraryofbabel 10 days ago | parent | prev [-]

I mean yeah, it’s a good essay in that it made me think and try to articulate the gaps, and I’m always looking to read things that push back on AI hype. I usually just skip over the hype blogging.

I think my biggest complaint is that the essay points out flaws in LLM’s world models (totally valid, they do confidently get things wrong and hallucinate in ways that are different, and often more frustrating, from how humans get things wrong) but then it jumps to claiming that there is some fundamental limitation about LLMs that prevents them from forming workable world models. In particular, it strays a bit towards the “they’re just stochastic parrots” critique, e.g. “that just shows the LLM knows to put the words explaining it after the words asking the question.” That just doesn’t seem to hold up in the face of e.g. LLMs getting gold on the Mathematical Olympiad, which features novel questions. If that isn’t a world model of mathematics - being able to apply learned techniques to challenging new questions - then I don’t know what is.

A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.

I do think there’s a lot that the essay gets right. If I was to recast it, I’d put it something like this:

* LLMs have imperfect models of the world which is conditioned by how they’re trained on next token prediction.

* We’ve shown we can drastically improve those world models for particular tasks by reinforcement learning. you kind of allude to this already by talking about how they’ve been “flogged” to be good at math.

* I would claim that there’s no particular reason these RL techniques aren’t extensible in principle to beat all sorts of benchmarks that might look unrealistic now. (Two years ago it would have been an extreme optimist position to say an LLM could get gold on the mathematical Olympiad, and most LLM skeptics would probably have said it could never happen.)

* Of course it’s very expensive, so most world models LLMs have won’t get the RL treatment and so will be full of gaps, especially for things that aren’t amenable to RL. It’s good to beware of this.

I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. (If an LLM was given tons of RL training on that codebase, it could build a better world model, but that’s expensive and very challenging to set up.) This problem is hinted at in your essay, but the lack of on-the-job learning isn’t centered. But it’s the real elephant in the room with LLMs and the one the boosters don’t really have an answer to.

Anyway thanks for writing this and responding!

yosefk 10 days ago | parent | next [-]

I'm not saying that LLMs can't learn about the world - I even mention how they obviously do it, even at the learned embeddings level. I'm saying that they're not compelled by their training objective to learn about the world and in many cases they clearly don't, and I don't see how to characterize the opposite cases in a more useful way than "happy accidents."

I don't really know how they are made "good at math," and I'm not that good at math myself. With code I have a better gut feeling of the limitations. I do think that you could throw them off terribly with unusual math quastions to show that what they learned isn't math, but I'm not the guy to do it; my examples are about chess and programming where I am more qualified to do it. (You could say that my question about the associativity of blending and how caching works sort of shows that it can't use the concept of associativity in novel situations; not sure if this can be called an illustration of its weakness at math)

8 days ago | parent | next [-]
[deleted]
calf 8 days ago | parent | prev [-]

But this is parallel to saying LLMs are not "compelled" by the training algorithms to learn symbolic logic.

Which says to me there are two camps on this and the verdict is still out on this and all related questions.

teleforce 7 days ago | parent [-]

>LLMs are not "compelled" by the training algorithms to learn symbolic logic.

I think "compell" is such a unique human trait that machine will never replicate to the T.

The article did mention specifically about this very issue:

"And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform."

If it's not already evident that in itself LLM is a limited stochastic AI tool by definition and its distant cousins are the deterministic logic, optimization and constraint programming [1],[2],[3]. Perhaps one of the two breakthroughs that the author was predicting will be in this deterministic domain in order to assist LLM, and it will be the hybrid approach rather than purely LLM.

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:

https://youtube.com/watch?v=HB5TrK7A4pI

[3] Google OR-Tools:

https://developers.google.com/optimization

[4] MiniZinc:

https://www.minizinc.org/

calf 7 days ago | parent [-]

And yet there are two camps on the matter. Experts like Hinton disagree, others agree.

2muchcoffeeman 8 days ago | parent | prev | next [-]

It’s not just on the job learning though. I’m no AI expert, but the fact that you have “prompt engineers” and AI doesn’t know what it doesn’t know, gives me pause.

If you ask an expert, they know the bounds of their knowledge and can understand questions asked to them in multiple ways. If they don’t know the answer, they could point to someone who does or just say “we don’t know”.

LLMs just lie to you and we call it “hallucinating“ as though they will eventually get it right when the drugs wear off.

eru 8 days ago | parent | next [-]

> I’m no AI expert, but the fact that you have “prompt engineers” [...] gives me pause.

Why? A bunch of human workers can get a lot more done with a capable leader who helps prompt them in the right direction and corrects oversights etc.

And overall, prompt engineering seems like exactly the kind of skill AI will be able to develop by itself. You already have a bit like this happening: when you ask Gemini to create a picture for you, then the language part of Gemini will take your request and engineer a prompt for the picture part of Gemini.

intended 8 days ago | parent [-]

This is the goalpost flip which happens in AI conversations. If goalpost is even the right term, conversation switch?

Theres 2 AI conversations on HN occurring simultaneously.

Convo A: Is it actually reasoning? does it have a world model? etc..

Convo B: Is it good enough right now? (for X, Y, or Z workflow)

eru 8 days ago | parent [-]

Maybe, yes. It's good to acknowledge that both of these conversations are worthwhile to have.

Mikhail_Edoshin 8 days ago | parent | prev [-]

LLM comprehends, but does not understand. It is interesting to see these two qualities separated; so far they were synonyms.

eru 8 days ago | parent | prev [-]

> A lot of that success is from reinforcement learning techniques where the LLM is made to solve tons of math problems after the pre-training “read everything” step, which then gives it a chance to update its weights. LLMs aren’t just trained from reading a lot of text anymore. It’s very similar to how the alpha zero chess engine was trained, in fact.

It's closer to AlphaGo, which first trained on expert human games and then 'fine tuned' with self-play.

AlphaZero specifically did not use human training data at all.

I am waiting for an AlphaZero style general AI. ('General' not in the GAI sense but in the ChatGPT sense of something you can throw general problems at and it will give it a good go, but not necessarily at human level, yet.) I just don't want to call it an LLM, because it wouldn't necessarily be trained on language.

What I have in mind is something that first solves lots and lots of problems, eg logic problems, formally posed programming problems, computer games, predicting of next frames in a web cam video, economic time series, whatever, as a sort-of pre-training step and then later perhaps you feed it a relatively small amount of human readable text and speech so you can talk to it.

Just to be clear: this is not meant as a suggestion for how to successfully train an AI. I'm just curious whether it would work at all and how well / how badly.

Presumably there's a reason why all SOTA models go 'predict human produced text first, then learn problem solving afterwards'.

> I think the biggest limitation LLMs actually have, the one that is the biggest barrier to AGI, is that they can’t learn on the job, during inference. This means that with a novel codebase they are never able to build a good model of it, because they can never update their weights. [...]

Yes, I agree. But 'on-the-job' training is also such an obvious idea that plenty of people are working on making it work.

AyyEye 10 days ago | parent | prev | next [-]

With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.

BobbyJo 10 days ago | parent | next [-]

The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.

It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.

In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.

I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/

LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.

vrighter 10 days ago | parent | next [-]

That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.

Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.

xigoi 8 days ago | parent | prev [-]

> It's like asking a blind person to count the number of colors on a car.

I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.

eru 8 days ago | parent | prev | next [-]

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.

yosefk 10 days ago | parent | prev | next [-]

Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think

astrange 8 days ago | parent [-]

Tokenization makes things harder, but it doesn't make them impossible. Just takes a bit more memorization.

Other writing systems come with "tokenization" built in making it still a live issue. Think of answering:

1. How many n's are in 日本?

2. How many ん's are in 日本?

(Answers are 2 and 1.)

andyjohnson0 10 days ago | parent | prev | next [-]

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Is this a real defect, or some historical thing?

I just asked GPT-5:

    How many "B"s in "blueberry"?
and it replied:

    There are 2 — the letter b appears twice in "blueberry".
I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.
libraryofbabel 10 days ago | parent | next [-]

It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.

jijijijij 8 days ago | parent | next [-]

The question is, did these LLMs figured it out by themselves or has someone programmed a specific coroutine to address this „issue“, to make it look smarter than it is?

On a trillion dollar budget, you could just crawl the web for AI tests people came up with and solve them manually. We know it‘s a massively curated game. With that kind of money you can do a lot of things. You could feed every human on earth countless blueberries for starters.

Calling an algorithm to count letters in a word isn’t exactly worth the hype tho is it?

The point is, we tend to find new ways these LLMs can’t figure out the most basic shit about the world. Horses can count. Counting is in everything. If you read every text ever written and still can’t grasp counting you simply are not that smart.

pydry 10 days ago | parent | prev | next [-]

Depends how you define historical. If by historical you mean more than two days ago then, yeah, it's ancient history.

pxc 8 days ago | parent | prev [-]

Some LLMs do better than others, but this still sometimes trips up even "frontier" non-reasoning models. People were showing this on this very forum with GPT-5 in the past couple days.

bgwalter 10 days ago | parent | prev | next [-]

It is not historical:

https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...

Perhaps they have a hot fix that special cases HN complaints?

AyyEye 10 days ago | parent [-]

They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.

Terr_ 8 days ago | parent [-]

I wouldn't be surprised if some models get set up to identify that type of question and run the word through string processing function.

jijijijij 8 days ago | parent [-]

Of course they do stuff like that, otherwise it would look like they are stagnating. Fake it till you make it. Tho, at this point, the world is in deep shit, if they don’t make it…

pmg101 7 days ago | parent [-]

What deep shit do you foresee?

My prediction is that this will be like the 2000 dot com bubble. Both dot com and AI are real and really useful technologies but hype and share price has got way ahead of it so will need to re adjust.

jijijijij 7 days ago | parent [-]

A major economic crisis, yes. I think the web is already kinda broken because of AI, gonna get a lot worse. I also question its usefulness… Is it useful solving any real problems, and if so how long before we run out of these problems? Because we conflated a lot of bullshit with innovation right before AI. Right now people may be getting a slight edge, but it’s like getting a dishwasher, once expectations adjusted things will feel like a grind again, and I really don’t think people will like that new reality in regard to experience of self-efficacy (which is important for mental health). I presume the struggle to get information, figuring it out yourself, may be a really important part of putting pressure towards process optimization and for learning, cognitive development. We may collectively regress there. With so many major crisis, a potential economic crisis on top, I am not sure we can afford losing problem solving capabilities to any extent. And I really, really don’t think AI is worth the fantastical energy expenditure, waste of resources and human exploitation, so far.

ThrowawayR2 10 days ago | parent | prev | next [-]

It was discussed and reproduced on GPT-5 on HN couple of days ago: https://news.ycombinator.com/item?id=44832908

Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.

nosioptar 10 days ago | parent | prev [-]

Shouldn't the correct answer be that there is not a "B" in "blueberry"?

eru 8 days ago | parent [-]

No, why?

It depend on context. English is often not very precise and relies on implied context clues. And that's good. It makes communication more efficient in general.

To spell it out: in this case I suspect you are talking about English letter case? Most people don't care about case when they ask these questions, especially in an informal question.

Nevermark 7 days ago | parent | prev | next [-]

That was always a specious test.

LLMs don't ingest text a character at a time. The difficulty with analyzing individual letterings just reflected that they don't directly "see" letters in their tokenized input.

A direct comparison would be asking someone how many convex Bézier curves are in the spoken word "monopoly".

Or how many red pixels are in a visible icon.

We could work out answers to both. But they won't come to us one-shot or accurately, without specific practice.

libraryofbabel 10 days ago | parent | prev | next [-]

> they clearly don't have any world model whatsoever

Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.

simiones 8 days ago | parent [-]

> where it certainly hadn’t seen the questions before?

What are you basing this certainty on?

And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.

It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.

eru 8 days ago | parent | next [-]

> What are you basing this certainty on?

People make up new questions for each IMO.

fxtentacle 7 days ago | parent [-]

Didn’t OpenAI get caught bribing their way to pre-tournament access of the questions?

eru 7 days ago | parent [-]

This is the first time I hear about this. (It's certainly possible, but I'd need to see some evidence or at least a write-up.)

OpenAI got flamed over announcing their results before the embargo was up:

IMO had asked companies to wait at least a week or so after the human winners were announced to announce the AI results. OpenAI did not wait.

libraryofbabel 8 days ago | parent | prev [-]

Like the other reply said, each exam has entirely new questions which are of course secret until the test is taken.

Sure, the questions were probably in a similar genre as existing questions or required similar techniques that could be found in solutions that are out there. So what? You still need some kind of world model of mathematics in which to understand the new problem and apply the different techniques to solve it.

Are you really claiming that SOTA LLMs don’t have any world model of mathematics at all? If so, can you tell us what sort of example would convince you otherwise? (Note that the ability to do novel mathematics research is setting the bar too high, because many capable mathematics majors never get to that point, and they clearly have a reasonable model of mathematics in their heads.)

williamcotton 8 days ago | parent | prev | next [-]

I don’t solve math problems with my poetry writing skills:

https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...

derdi 7 days ago | parent | prev [-]

Ask a kid that doesn't know how to read and write how many Bs there are in blueberry.

Ygg2 7 days ago | parent [-]

For a kid that doesn't know to read or write, Chat GPT writes way too much.

joe_the_user 8 days ago | parent | prev | next [-]

I think both the literature on interpretability and explorations on internal representations actually reinforce the author's conclusion. I think internal representation research tends to nets that deal with a single "model" don't necessary have the same representation and don't necessarily have a single representation.

And doing well on XYZ isn't evidence of a world model in particular. The point that these things aren't always using a world is reinforced by systems being easily confused by extraneous information, even systems as sophisticated as thus that can solve Math Olympiad questions. The literature has said "ad-hoc predictors" for a long time and I don't think much has changed - except things do better on benchmarks.

And, humans too can act without a consistent world model.

lossolo 10 days ago | parent | prev | next [-]

https://arxiv.org/abs/2508.01191

armchairhacker 10 days ago | parent | prev [-]

Any suggestions from this literature?

libraryofbabel 10 days ago | parent [-]

The papers from Anthropic on interpretability are pretty good. They look at how certain concepts are encoded within the LLM.