People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.

You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.

Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.

▲

coldtea 3 days ago | parent | next [-]

>Counting is something that even humans need to learn how to do

No human who can program, solve advanced math problems, or can talk about advanced problem domains at expert level, however, would fail to count to 5.

This is not a mere "LLMs, like humans, also need to be taught this" but points to a fundamental mismatch about how humans and LLMs learn.

(And even if they merely needed to be taught, why would their huge corpus fail to cover that "teaching", but cover way more advanced topics in math solving and other domains?)

▲

Topfi 3 days ago | parent | prev | next [-]

Respectfully, toddlers cannot output useable code or have otherwise memorised results to an immense number of maths equations.

What this points at is the abstraction/emergence crux of it all. Why does an otherwise very capable LLM such as the GPT-5 series, despite having been trained on vastly more examples of frontend code of all shapes, sizes and quality levels, struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples?

If LLMs, as they are now, were comparable with human learning, there'd be no scenario where a model that can provide output solving highly advanced equations can not count properly.

Similarly, a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on.

These models, I think at this point there is little doubt, are impressive tools, but they still do not generalise or abstract information in the way a human mind does. Doesn't make them less impactful for industries, etc. but it makes any comparison to humans not very suitable.

▲

BugsJustFindMe 3 days ago | parent [-]

> What this points at is the abstraction/emergence crux of it all. Why does

This paper has nothing to do with any questions starting with "why". It provides a metric for quantifying error on specific tasks.

> If LLMs, as they are now, were comparable with human learning

I think I missed the part where they need to be.

> struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples? ... a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on

There is a very big and very important difference between producing the same thing again and not being able to produce something else. When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.

Long before AI there was this thing called Twitter Bootstrap. It dominated the web for...much longer than it should have. And that tragedy was done entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).

[I've edited this comment for content and format]

	▲	Topfi 3 days ago \| parent \| next [-]
		[...] common trope that was proven false years ago by the existence of zero shot learning. Ok, that's better than comparing LLMs to humans. ZSL however, has not proven anything of that sort false years ago, as it was mainly concerned with assessing whether LLMs are solely relying on precise instruction training or can generalise in a very limited degree beyond the initial tuning. That has never allowed for comparing human learning to LLM training. Ironically, you are writing this under a paper that shows just that: A model that cannot determine a short strings parity cannot have abstracted from the training data to arrive at the far more impressive and complicated maths challenges which it successfully solves in output. Some of the solutions we have seen in output require such innate understanding that, if there is no generalisation, far deeper than ZSL has ever shown, than this must come from training. Simple multiplication, etc. maybe, not the tasks people such as Easy Riders [0] throw at these models. This paper shows exactly that even with ZSL, these models do only abstract in an incredibly limited manner and a lot of capabilities we see in the output are specifically trained, not generalised. Yes, generalisation in a limited capacity can happen, but no, it is not nearly close enough to yield some of the results we are seeing. I have also, neither here, nor in my initial comment, said that LLMs are only capable of outputting what their training data provides, merely that given what GPT-5 has been trained with, if there was any deeper abstraction these models gained during training, it'd be able to provide more than one frontend style. Or to put it simpler, if the output provided can be useful for Maths at the Bachelor level and beyond and this capability is generalised as you believe, these tasks would not be a struggle for the model. [0] https://www.youtube.com/@easy_riders
	▲	Topfi 3 days ago \| parent \| prev [-]
		Just saw the edit. > When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability. Ignoring the comparison with humans, yes, LLMs don't output something unless prompted specifically, of course. My point with GPT-5 was that, no matter how you prompt, you cannot get salvageable frontend code from this line of models. OpenAI themselves tried and failed appallingly [0]. Call it "constraints", call it "reason", call it "prompting", you cannot get frontend code that deviates significantly from their card laden training data. Despite GPT-5 having been trained with more high quality frontend code examples than any human could ever read in a lifetime, that one template is over presented, because the model never generalised anything akin to an understanding of UI principles or what code yields a specific design. These are solvable problems, mind you, but not because a model at some stage gains anything that one could call an abstract understanding of these concepts. Instead, by providing better training data or being clever in how you provide existing training data. Gemini 3 and Claude 4 class models have a more varied training set, specifically of frontend templates yielding better results though if you do any extended testing you will see these repeat constantly because again, these models never abstract from that template collection [1]. Moonshot meanwhile with K2.5 did a major leap by tying their frontend code tightly to visual input, leveraging the added vision encoder [2]. They are likely not the only ones doing that, but the first that clearly stated it reading the system cards. Even there, the gains are limited to a selection of very specific templates. In either case, more specific data, not abstractions by these models yield improvements. > Twitter Bootstrap [...] entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope). What? So because some devs relied on Bootstrap that means, what exactly? That no one asked/told them to leverage a different solution, be more creative, what? Again ignoring the comparison to humans which just is not appropriate for this tech, we can and do prompt models for specific frontend output. We are, if you must, providing the goal. The model however cannot accomplish said goal, even OpenAI cannot get GPT-5s lineage to deviate from their one template. If we must stick with the human comparison and if we must further limit it to Bootstrap, GPT-5 despite being specifically prompted to never use the Carousel in Bootstrap, can not output any website without including a Carousel, because the template it was trained on included one. Any human developer asked to do so would just not include a Carousel, because their abilities are abstracted beyond the one Bootstrap template they first learned with. But if we truly wanted to make this fair, it'd actually have to be a human who was trained on thousands of Bootstrap example pages, but just one template really well and never connected anything between that one and the others. Which isn't very human, but then again, that's why this comparison is not really a solid one. [0] Subjectively not one good result, objectively even their team of experts could not get their own model to seize the telltale signs of GPT frontend slop that originated from a template they have been training with since Horizon: https://developers.openai.com/blog/designing-delightful-fron... [1] https://ui-design-bench.vercel.app [2] https://www.kimi.com/blog/kimi-k2-5

▲

nkrisc 3 days ago | parent | prev | next [-]

You’re conflating counting and language.

Many animals can count. Counting is recognizing that the box with 3 apples is preferable to the one with 2 apples.

Yes, 2 year olds might struggle with the externalization of numeric identities but if you have 1 M&M in one hand and 5 in the other and ask which they want, they’ll take the 5.

LLMs have the language part down, but fundamentally can’t count.

▲

BugsJustFindMe 3 days ago | parent [-]

The concept of bigger/smaller is useful but is a distinct skill from counting. If you spread the M&Ms apart enough that the part of the brain responsible for gestalt clustering can't group them into a "bigger whole" signal, they'll no longer be able to do the thing you're saying (this is the law of proximity in gestalt psychology).

▲

adrian_b 3 days ago | parent [-]

Most animals can distinguish bigger from smaller.

However many animals can distinguish independently small numbers, like 3 or 5, and recognize them whenever they see them.

So in this respect, there is little difference between humans and many animals. Humans learn to count to arbitrarily big numbers, but they can still easily recognize only small numbers.

▲

BugsJustFindMe 3 days ago | parent [-]

> many animals can distinguish independently small numbers, like 3 or 5

This is called subitizing. It's distinct from counting. We can see the difference in humans with Simultanagnosia, who are unable to count beyond the subitizing range. Subitizing is categorizing the scale of a small gestalt group.

The only thing I've ever seen where an animal appeared to demonstrate counting (up to 3) without training was in rhesus monkeys (maybe also chimpanzees?), but even that experiment could be explained through temporal gestalt. (It's the only reason I know of for them to not have been able to go higher than 3 in that experiment in the context of many other things that they can do.)

▲

somethingweird 3 days ago | parent [-]

Even parrots can count to 6 and more, I would be surprised if primates couldn't.

	▲	BugsJustFindMe 3 days ago \| parent [-]
		At least one has maybe been shown to be able to do that with 30 years of focused training, but none have been shown to be able without training. Wild parrots have only demonstrated subitizing and size discrimination, not counting. The overeager do quite often confuse subitizing and size discrimination for counting, though. That's its own problem.

▲

irishcoffee 3 days ago | parent | prev [-]

> Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If they're able to count to even 10 it's through memorization and not understanding.

I completely agree with you. LLMs are regurgitation machines with less intellect than a toddler, you nailed it.

AI is here!