Remix.run Logo
AyyEye 10 days ago

With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever. That addition (something which only takes a few gates in digital logic) happens to be overfit into a few nodes on multi-billion node networks is hardly a surprise to anyone except the most religious of AI believers.

BobbyJo 10 days ago | parent | next [-]

The core issue there isn't that the LLM isn't building internal models to represent its world, it's that its world is limited to tokens. Anything not represented in tokens, or token relationships, can't be modeled by the LLM, by definition.

It's like asking a blind person to count the number of colors on a car. They can give it a go and assume glass, tires, and metal are different colors as there is likely a correlation they can draw from feeling them or discussing them. That's the best they can do though as they can't actually perceive color.

In this case, the LLM can't see letters, so asking it to count them causes it to try and draw from some proxy of that information. If it doesn't have an accurate one, then bam, strawberry has two r's.

I think a good example of LLMs building models internally is this: https://rohinmanvi.github.io/GeoLLM/

LLMs are able to encode geospatial relationships because they can be represented by token relationships well. Teo countries that are close together will be talked about together much more often than two countries far from each other.

vrighter 10 days ago | parent | next [-]

That is just not a solid argument. There are countless examples of LLMs splitting "blueberry" into "b l u e b e r r y", which would contain one token per letter. And then they still manage to get it wrong.

Your argument is based on a flawed assumption, that they can't see letters. If they didn't they wouldn't be able to spell the word out. But they do. And when they do get one token per letter, they still miscount.

xigoi 8 days ago | parent | prev [-]

> It's like asking a blind person to count the number of colors on a car.

I presume if I asked a blind person to count the colors on a car, they would reply “sorry, I am blind, so I can’t answer this question”.

eru 8 days ago | parent | prev | next [-]

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Train your model on characters instead of on tokens, and this problem goes away. But I don't think this teaches us anything about world models more generally.

yosefk 10 days ago | parent | prev | next [-]

Actually I forgive them those issues that stem from tokenization. I used to make fun at them for listing datum as a noun whose plural form ends with an i, but once I learned about how tokenization works, I no longer do it - it feels like mocking a person's intelligence because of a speech impediment or something... I am very kind to these things, I think

astrange 8 days ago | parent [-]

Tokenization makes things harder, but it doesn't make them impossible. Just takes a bit more memorization.

Other writing systems come with "tokenization" built in making it still a live issue. Think of answering:

1. How many n's are in 日本?

2. How many ん's are in 日本?

(Answers are 2 and 1.)

andyjohnson0 10 days ago | parent | prev | next [-]

> With LLMs being unable to count how many Bs are in blueberry, they clearly don't have any world model whatsoever.

Is this a real defect, or some historical thing?

I just asked GPT-5:

    How many "B"s in "blueberry"?
and it replied:

    There are 2 — the letter b appears twice in "blueberry".
I also asked it how many Rs in Carrot, and how many Ps in Pineapple, amd it answered both questions correctly too.
libraryofbabel 10 days ago | parent | next [-]

It’s a historical thing that people still falsely claim is true, bizarrely without trying it on the latest models. As you found, leading LLMs don’t have a problem with it anymore.

jijijijij 8 days ago | parent | next [-]

The question is, did these LLMs figured it out by themselves or has someone programmed a specific coroutine to address this „issue“, to make it look smarter than it is?

On a trillion dollar budget, you could just crawl the web for AI tests people came up with and solve them manually. We know it‘s a massively curated game. With that kind of money you can do a lot of things. You could feed every human on earth countless blueberries for starters.

Calling an algorithm to count letters in a word isn’t exactly worth the hype tho is it?

The point is, we tend to find new ways these LLMs can’t figure out the most basic shit about the world. Horses can count. Counting is in everything. If you read every text ever written and still can’t grasp counting you simply are not that smart.

pydry 10 days ago | parent | prev | next [-]

Depends how you define historical. If by historical you mean more than two days ago then, yeah, it's ancient history.

pxc 8 days ago | parent | prev [-]

Some LLMs do better than others, but this still sometimes trips up even "frontier" non-reasoning models. People were showing this on this very forum with GPT-5 in the past couple days.

bgwalter 10 days ago | parent | prev | next [-]

It is not historical:

https://kieranhealy.org/blog/archives/2025/08/07/blueberry-h...

Perhaps they have a hot fix that special cases HN complaints?

AyyEye 10 days ago | parent [-]

They clearly RLHF out the embarrassing cases and make cheating on benchmarks into a sport.

Terr_ 8 days ago | parent [-]

I wouldn't be surprised if some models get set up to identify that type of question and run the word through string processing function.

jijijijij 8 days ago | parent [-]

Of course they do stuff like that, otherwise it would look like they are stagnating. Fake it till you make it. Tho, at this point, the world is in deep shit, if they don’t make it…

pmg101 7 days ago | parent [-]

What deep shit do you foresee?

My prediction is that this will be like the 2000 dot com bubble. Both dot com and AI are real and really useful technologies but hype and share price has got way ahead of it so will need to re adjust.

jijijijij 7 days ago | parent [-]

A major economic crisis, yes. I think the web is already kinda broken because of AI, gonna get a lot worse. I also question its usefulness… Is it useful solving any real problems, and if so how long before we run out of these problems? Because we conflated a lot of bullshit with innovation right before AI. Right now people may be getting a slight edge, but it’s like getting a dishwasher, once expectations adjusted things will feel like a grind again, and I really don’t think people will like that new reality in regard to experience of self-efficacy (which is important for mental health). I presume the struggle to get information, figuring it out yourself, may be a really important part of putting pressure towards process optimization and for learning, cognitive development. We may collectively regress there. With so many major crisis, a potential economic crisis on top, I am not sure we can afford losing problem solving capabilities to any extent. And I really, really don’t think AI is worth the fantastical energy expenditure, waste of resources and human exploitation, so far.

ThrowawayR2 10 days ago | parent | prev | next [-]

It was discussed and reproduced on GPT-5 on HN couple of days ago: https://news.ycombinator.com/item?id=44832908

Sibling poster is probably mistakenly thinking of the strawberry issue from 2024 on older LLM models.

nosioptar 10 days ago | parent | prev [-]

Shouldn't the correct answer be that there is not a "B" in "blueberry"?

eru 8 days ago | parent [-]

No, why?

It depend on context. English is often not very precise and relies on implied context clues. And that's good. It makes communication more efficient in general.

To spell it out: in this case I suspect you are talking about English letter case? Most people don't care about case when they ask these questions, especially in an informal question.

Nevermark 7 days ago | parent | prev | next [-]

That was always a specious test.

LLMs don't ingest text a character at a time. The difficulty with analyzing individual letterings just reflected that they don't directly "see" letters in their tokenized input.

A direct comparison would be asking someone how many convex Bézier curves are in the spoken word "monopoly".

Or how many red pixels are in a visible icon.

We could work out answers to both. But they won't come to us one-shot or accurately, without specific practice.

libraryofbabel 10 days ago | parent | prev | next [-]

> they clearly don't have any world model whatsoever

Then how did an LLM get gold on the mathematical Olympiad, where it certainly hadn’t seen the questions before? How on earth is that possible without a decent working model of mathematics? Sure, LLMs might make weird errors sometimes (nobody is denying that), but clearly the story is rather more complicated than you suggest.

simiones 8 days ago | parent [-]

> where it certainly hadn’t seen the questions before?

What are you basing this certainty on?

And even if you're right that the specific questions had not come up, it may still be that the questions from the math olympiad were rehashes of similar questions in other texts, or happened to correspond well to a composition of some other problems that were part of the training set, such that the LLM could 'pick up' on the similarity.

It's also possible that the LLM was specifically trained on similar problems, or may even have a dedicated sub-net or tool for it. Still impressive, but possibly not in a way that generalizes even to math like one might think based on the press releases.

eru 8 days ago | parent | next [-]

> What are you basing this certainty on?

People make up new questions for each IMO.

fxtentacle 7 days ago | parent [-]

Didn’t OpenAI get caught bribing their way to pre-tournament access of the questions?

eru 7 days ago | parent [-]

This is the first time I hear about this. (It's certainly possible, but I'd need to see some evidence or at least a write-up.)

OpenAI got flamed over announcing their results before the embargo was up:

IMO had asked companies to wait at least a week or so after the human winners were announced to announce the AI results. OpenAI did not wait.

libraryofbabel 8 days ago | parent | prev [-]

Like the other reply said, each exam has entirely new questions which are of course secret until the test is taken.

Sure, the questions were probably in a similar genre as existing questions or required similar techniques that could be found in solutions that are out there. So what? You still need some kind of world model of mathematics in which to understand the new problem and apply the different techniques to solve it.

Are you really claiming that SOTA LLMs don’t have any world model of mathematics at all? If so, can you tell us what sort of example would convince you otherwise? (Note that the ability to do novel mathematics research is setting the bar too high, because many capable mathematics majors never get to that point, and they clearly have a reasonable model of mathematics in their heads.)

williamcotton 8 days ago | parent | prev | next [-]

I don’t solve math problems with my poetry writing skills:

https://chatgpt.com/share/689ba837-8ae0-8013-96d2-7484088f27...

derdi 7 days ago | parent | prev [-]

Ask a kid that doesn't know how to read and write how many Bs there are in blueberry.

Ygg2 7 days ago | parent [-]

For a kid that doesn't know to read or write, Chat GPT writes way too much.