To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.

This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.

▲

stratos123 3 days ago | parent | next [-]

> saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.

Why do you think the results of this paper contradict these claims at all?

▲

grey-area 3 days ago | parent [-]

A machine which confabulates and cannot count is not a good fit for accounting tasks. They’ll make all sorts of subtle errors which are difficult for humans to notice.

	▲	stratos123 2 days ago \| parent [-]
		That wouldn't even necessarily be true if models really "couldn't count", since software exists - if an LLM is making an Excel spreadsheet rather than doing everything manually, it's both much harder for it to mess up and easier to notice and recover. It's even less true given that what this paper actually tests is "LLMs don't have a literally perfect accuracy when you make them do increasingly big problems with zero thinking". (Confabulation is IMO a much bigger problem, but it's unrelated to architecture - it's an artifact of how models are currently trained.)

▲

stronglikedan 3 days ago | parent | prev | next [-]

> general public

and the C-suite

▲

orbital-decay 3 days ago | parent | prev [-]

Quick sanity check: you're susceptible to pretty irresistible optical illusions which would never fool a VLM, does it mean you're not thinking? In fact, with a non-monospaced font I also have trouble determining whether these parens are balanced, and have to select them with the mouse, i.e. use a "dumb" tool, to make sure.

Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat bugs as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.

	▲	flextheruler 2 days ago \| parent [-]
		Who is hiring anyone to look at a screen to count characters? Don't be disingenuous in your argument. The apt comparison would be the current technique used to accomplish this task i.e. a pattern matching algorithm.