Remix clone Hacker News

I think at this point it’s very clear LLM aren’t achieving any form of “reasoning” as commonly understood. Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.

▲

xg15 4 days ago | parent | next [-]

I don't want to say that LLMs can reason, but this kind of argument always feels to shallow for me. It's kind of like saying that bats cannot possibly fly because they have no feathers or that birds cannot have higher cognitive functions because they have no neocortex. (The latter having been an actual longstanding belief in science which has been disproven only a decade or so ago).

The "next token prediction" is just the API, it doesn't tell you anything about the complexity of the thing that actually does the prediction. (In think there is some temptation to view LLMs as glorified Markov chains - they aren't. They are just "implementing the same API" as Markov chains).

There is still a limit how much an LLM could reason during prediction of a single token, as there is no recurrence between layers, so information can only be passed "forward". But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.

I think this structure makes it quite hard to really say how much reasoning is possible.

▲

griomnib 4 days ago | parent | next [-]

I agree with most of what you said, but “LLM can reason” is an insanely huge claim to make and most of the “evidence” so far is a mixture of corporate propaganda, “vibes”, and the like.

I’ve yet to see anything close to the level of evidence needed to support the claim.

▲

vidarh 4 days ago | parent | next [-]

To say any specific LLM can reason is a somewhat significant claim.

To say LLMs as a class is architecturally able to be trained to reason is - in the complete absence of evidence to suggest humans can compute functions outside the Turing computable - is effectively only an argument that they can implement a minimal Turing machine given the context is used as IO. Given the size of the rules needed to implement the smallest known Turing machines, it'd take a really tiny model for them to be unable to.

Now, you can then argue that it doesn't "count" if it needs to be fed a huge program step by step via IO, but if it can do something that way, I'd need some really convincing evidence for why the static elements those steps could not progressively be embedded into a model.

▲

wizzwizz4 3 days ago | parent [-]

No such evidence exists: we can construct such a model manually. I'd need some quite convincing evidence that any given training process is approximately equivalent to that, though.

▲

vidarh 3 days ago | parent [-]

That's fine. I've made no claim about any given training process. I've addressed the annoying repetitive dismissal via the "but they're next token predictors" argument. The point is that being next token predictors does not limit their theoretical limits, so it's a meaningless argument.

▲

wizzwizz4 3 days ago | parent [-]

The architecture of the model does place limits on how much computation can be performed per token generated, though. Combined with the window size, that's a hard bound on computational complexity that's significantly lower than a Turing machine – unless you do something clever with the program that drives the model.

	▲	vidarh 3 days ago \| parent [-]
		Hence the requirement for using the context for IO. A Turing machine requires two memory "slots" (the position of the read head, and the current state) + IO and a loop. That doesn't require much cleverness at all.

▲

int_19h 3 days ago | parent | prev | next [-]

"LLM can reason" is trivially provable - all you need to do is give it a novel task (e.g. a logical puzzle) that requires reasoning, and observe it solving that puzzle.

▲

staticman2 3 days ago | parent [-]

How do you intend to show your task is novel?

	▲	int_19h 19 hours ago \| parent [-]
		"Novel" here simply means that the exact sequence of moves that is the solution cannot possibly be in the training set (mutatis mutandis). You can easily write a program that generates these kinds of puzzles at random, and feed them to the model.

▲

hackinthebochs 3 days ago | parent | prev | next [-]

Then say "no one has demonstrated that LLMs can reason" instead of "LLMs can't reason, they're just token predictors". At least that would be intellectually honest.

▲

Xelynega 3 days ago | parent [-]

By that logic isn't it "intellectually dishonest" to say "dowsing rods don't work" if the only evidence we have is examples of them not working?

	▲	hackinthebochs 3 days ago \| parent [-]
		Not really. We know enough about how the world to know that dowsing rods have no plausible mechanism of action. We do not know enough about intelligence/reasoning or how brains work to know that LLMs definitely aren't doing anything resembling that.

▲

Propelloni 4 days ago | parent | prev [-]

It's largely dependent on what we think "reason" means, is it not? That's not a pro argument from me, in my world LLMs are stochastic parrots.

▲

vidarh 4 days ago | parent | prev [-]

> But this limit doesn't exist if you consider the generation of the entire text: Suddenly, you do have a recurrence, which is the prediction loop itself: The LLM can "store" information in a generated token and receive that information back as input in the next loop iteration.

Now consider that you can trivially show that you can get an LLM to "execute" on step of a Turing machine where the context is used as an IO channel, and will have shown it to be Turing complete.

> I think this structure makes it quite hard to really say how much reasoning is possible.

Given the above, I think any argument that they can't be made to reason is effectively an argument that humans can compute functions outside the Turing computable set, which we haven't the slightest shred of evidence to suggest.

▲

Xelynega 3 days ago | parent [-]

It's kind of ridiculous to say that functions computable by turing computers are the only ones that can exist(and that trained llms are Turing computers).

What evidence do you have for either of these, since I don't recall any proof that "functions computable by Turing machines" is equal to the set of functions that can exist. And I don't recall pretrained llms being proven to be Turing machines.

	▲	vidarh 3 days ago \| parent [-]
		We don't have hard evidence that no other functions exist that are computable, but we have no examples of any such functions, and no theory for how to even begin to formulate any. As it stands, Church, Turing, and Kleene have proven that the set of generally recursive functions, the lambda calculus, and the Turing computable set are equivalent, and no attempt to categorize computable functions outside those sets has succeeded since. If you want your name in the history books, all you need to do is find a single function that humans can compute that a is outside the Turing computable set. As for LLMs, you can trivially test that they can act like a Turing machine if you give them a loop and use the context to provide access to IO: Turn the temperature down, and formulate a prompt to ask one to follow the rules of the simplest known Turing machine. A reminder that the simplest known Turing machine is a 2-state, 3-symbol Turing machine. It's quite hard to find a system that can carry out any kind of complex function that can't act like a Turing machine if you allow it to loop and give it access to IO.

▲

brookst 4 days ago | parent | prev | next [-]

> Among other factors it can be argued that true reasoning involves symbolic logic and abstractions, and LLM are next token predictors.

I think this is circular?

If an LLM is "merely" predicting the next tokens to put together a description of symbolic reasoning and abstractions... how is that different from really exercisng those things?

Can you give me an example of symbolic reasoning that I can't handwave away as just the likely next words given the starting place?

I'm not saying that LLMs have those capabilities; I'm question whether there is any utility in distinguishing the "actual" capability from identical outputs.

▲

vidarh 4 days ago | parent | next [-]

It is. As it stands, throw a loop around an LLM and act as the tape, and an LLM can obviously be made Turing complete (you can get it to execute all the steps of a minimal Turing machine, so drop temperature so its deterministic, and you have a Turing complete system). To argue that they can't be made to reason is effectively to argue that there is some unknown aspect of the brain that allows us to compute functions not in the Turing computable set, which would be an astounding revelation if it could be proven. Until someone comes up with evidence for that, it is more reasonable to assume that it is a question of whether we have yet found a training mechanism that can lead to reasoning or not, not whether or not LLMs can learn to.

▲

vundercind 3 days ago | parent [-]

It doesn’t follow that because a system is Turing complete the approach being used will eventually achieve reasoning.

	▲	vidarh 3 days ago \| parent [-]
		No, but that was also not the claim I made. The point is that as the person I replied to pointed out, that LLM's are "next token predictors" is a meaningless dismissal, as they can be both next token predictors and Turing complete, and given that unless reasoning requires functions outside the Turing computable (we know of no way of constructing such functions, or no way for them to exist) calling them "next token predictors" says nothing about their capabilities.

▲

griomnib 4 days ago | parent | prev | next [-]

Mathematical reasoning is the most obvious area where it breaks down. This paper does an excellent job of proving this point with some elegant examples: https://arxiv.org/pdf/2410.05229

▲

brookst 4 days ago | parent | next [-]

Sure, but people fail at mathematical reasoning. That doesn't mean people are incapable of reasoning.

I'm not saying LLMs are perfect reasoners, I'm questioning the value of asserting that they cannot reason with some kind of "it's just text that looks like reasoning" argument.

▲

NBJack 4 days ago | parent | next [-]

The idea is the average person would, sure. A mathematically oriented person would fair far better.

Throw all the math problems you want at a LLM for training; it will still fail if you step outside of the familiar.

▲

ben_w 4 days ago | parent [-]

> it will still fail if you step outside of the familiar.

To which I say:

ᛋᛟ᛬ᛞᛟ᛬ᚻᚢᛗᚪᚾᛋ

▲

trashtester 4 days ago | parent [-]

ᛒᚢᛏ ᚻᚢᛗᚪᚾ ᚻᚢᛒᚱᛁᛋ ᛈᚱᛖᚹᛖᚾᛏ ᚦᛖᛗ ᚠᚱᛟᛗ ᚱᛖᚪᛚᛁᛉᛁᚾᚷ ᚦᚻᚪᛏ

▲

ben_w 3 days ago | parent [-]

ᛁᚾᛞᛖᛖᛞ᛬ᛁᛏ᛬ᛁᛋ᛬ᚻᚢᛒᚱᛁᛋ

ᛁ᛬ᚻᚪᚹᛖ᛬ᛟᚠᛏᛖᚾ᛬ᛋᛖᛖᚾ᛬ᛁᚾ᛬ᛞᛁᛋᚲᚢᛋᛋᛁᛟᚾᛋ᛬ᛋᚢᚲ᛬ᚪᛋ᛬ᚦᛁᛋ᛬ᚲᛚᚪᛁᛗᛋ᛬ᚦᚪᛏ᛬ᚻᚢᛗᚪᚾ᛬ᛗᛁᚾᛞᛋ᛬ᚲᚪᚾ᛬ᛞᛟ᛬ᛁᛗᛈᛟᛋᛋᛁᛒᛚᛖ᛬ᚦᛁᛝᛋ᛬ᛋᚢᚲ᛬ᚪᛋ᛬ᚷᛖᚾᛖᚱᚪᛚᛚᚣ᛬ᛋᛟᛚᚹᛖ᛬ᚦᛖ᛬ᚻᚪᛚᛏᛁᛝ᛬ᛈᚱᛟᛒᛚᛖᛗ

edit: Snap, you said the same in your other comment :)

▲

trashtester 3 days ago | parent [-]

Switching back to latin letters...

It seems to me that the idea of the Universal Turing Machine is quite misleading for a lot of people, such as David Deutsch.

My impression is that the amount of compute to solve most problems that can really only be solved by Turing Machines is always going to remain inaccessible (unless they're trivally small).

But at the same time, the universe seems to obey a principle of locality (as long as we only consider the Quantum Wave Function, and don't postulate that it collapses).

Also, the quantum fields are subject to some simple (relative to LLMs) geometric symmetries, such as invariance under the U(1)xSU(2)xSU(3) group.

As it turns out, similar group symmetries can be found in all sorts of places in the real world.

Also it seems to me that at some level, both ANN's and biological brains set up a similar system to this physical reality, which may explain why brains develop this way and why both kinds are so good at simulating at least some aspects of the physical world, such as translation, rotation, some types of deformation, gravity, sound, light etc.

And when biological brains that initially developed to predict the physical world is then use to create language, that language is bound to use the same type of machinere. And this may be why LLM's do language so well with a similar architecture.

▲

vidarh 3 days ago | parent [-]

There are no problems that can be solved only by Turing Machines as any Turing complete system can simulate any other Turing complete system.

The point of UTM's is not to ever use them, but that they're a shortcut to demonstrating Turing completeness because of their simplicity. Once you've proven Turing completeness, you've proven that your system can compute all Turing computable functions and simulate any other Turing complete system, and we don't know of any computable functions outside this set.

▲

trashtester 3 days ago | parent [-]

When I wrote Turing Machine, I meant it as shorthand for Turing complete system.

My point is that any such system is extremely limited due to how slow they become at scale (when running algorithms/programs that require full turing completeness), due to it's "single threaded" nature. Such algorithms simply are not very parallelizable.

This means a Turing Complete system becomes nearly useless for things like AI. The same is the case inside a human brain, where signals can only travel at around the speed of sound.

Tensor / neuron based systems sacrifice Turing Completeness to gain (many) orders of magnitude more compute speed.

I know that a GPU's CAN in principle emulate a Turing Complete system, but they're even slower at it than CPU's, so that's irrelevant. The same goes for human brains.

People like Deutsch are so in love with the theoretical universality of Turing Completeness that he seems to ignore that Turing Complete system might take longer to formulate meaningful thought than the lifetime of a human. And possibly longer than the lifetime of the Sun, for complex ideas.

The fact that so much can be done by systems that are NOT Turing Complete may seem strange. But I argue that since the laws of Physics are local (with laws described by tensors), it should not be such a surprise that systems that computers that perform tensor computations are pretty good at simulating physical reality.

▲

vidarh 3 days ago | parent [-]

> My point is that any such system is extremely limited due to how slow they become at scale (when running algorithms/programs that require full turing completeness), due to it's "single threaded" nature. Such algorithms simply are not very parallelizable.

My point is that this isn't true. Every computer you've ever used is a Turing complete system, and there is no need for such a system to be single-threaded, as a multi-threaded system can simulate a single-threaded system and vice versa.

> I know that a GPU's CAN in principle emulate a Turing Complete system, but they're even slower at it than CPU's, so that's irrelevant. The same goes for human brains.

Any system that can emulate any other Turing complete system is Turing complete, so they are Turing complete.

You seem to confuse Turing completeness with a specific way of executing something. Turing completeness is about the theoretical limits on which set of functions a system can execute, not how they execute them.

▲

trashtester 2 days ago | parent [-]

> My point is that this isn't true. Every computer you've ever used is a Turing complete system, and there is no need for such a system to be single-threaded, as a multi-threaded system can simulate a single-threaded system and vice versa.

Not all algorithms can be distributed effectively across multiple threads.

A computer can have 1000 cpu cores, and only be able to use a single one when running such algorithms.

Some other algorithms may be distributed through branch predictions, by trying to run future processing steps ahead of time for each possible outcome of the current processing step. In fact, modern CPU's already do this a lot to speed up processing.

But even branch prediction hits something like a logarithmic wall of diminishing returns.

While you are right that multi core CPU's (or whole data centers) can run such algorithms, that doesn't mean they can run them quickly, hence my claim:

>> My point is that any such system is extremely limited due to how slow

Algorithms that can only utilize a single core seem to be stuck at the GFLOPS scale, regardless of what hardware they run on.

Even if only a small percentage (like 5%) of the code in a program is inherently limited to being single threaded (At best, you will achieve TFlops numbers), this imposes a fatal limitation on computational problems that require very large amounts of computing power. (For instance at the ExaFlop scale or higher.)

THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.

So if you want to do calculations that require, say, 1 ExaFlop (about the raw compute of the human brain) to be fast enough for a given purpose, you need to make almost all compute steps fully parallelizable.

Now that you've restricted your algorithm to no longer require all features of a Turing Complete system, you CAN still run it on Turing Complete CPU's. You're just not MAKING USE of their Turing Completeness. That's just very expensive.

At this point, you may as well build dedicated hardware that do not have all the optimizations that CPU have for single threaded computation, like GPU's or TPU's, and lower your compute cost by 10x, 100x or more (which could be the difference between $500 billion and $billion).

At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.

One way to describe this, is that from the space of all possible algorithms that can run on a Turing Complete system, you've selected a small sub-space of algorithms that can be parallelized.

By doing this trade, you've severely limited what algorithms you can run, in exchange for the ability to get a speedup of 6 orders of magnitude, or more in many cases.

And in order to keep this speedup, you also have to accept other hardware based limitations, such as staying within the amount of GPU memory available, etc.

Sure, you COULD train GPT-5 or Grok-3 on a C-64 with infinite casette tape storage, but it would take virtually forever for the training to finish. So that fact has no practical utility.

I DO realize that the concept of the equivalence of all Turing Complete systems is very beautiful. But this CAN be a distraction, and lead to intuitions that seem to me to be completely wrong.

Like Deutsch's idea that the ideas in a human brain are fundamentally linked to the brain's Turing Completeness. While in reality, it takes years of practice for a child to learn how to be Turing Complete, and even then the child's brain will struggle to do a floating point calculation every 5 minutes.

Meanwhile, joint systems of algorithms and the hardware they run on can do very impressive calculations when ditching some of the requirements of Turing Completeness.

▲

vidarh 2 days ago | parent [-]

> Not all algorithms can be distributed effectively across multiple threads.

Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.

> THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.

The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.

It's not about whether the algorithms require a Turing complete system, but that Turing completeness proves the equivalence of the upper limit of which set of functions the architecture can compute, and that pretty much any meaningful architecture you will come up with is still Turing complete.

> At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.

If a system can take 3 bits of input and use it to look up 5 bits of output in a table of 30 bits of data, and it is driven by a driver that uses 1 bit of the input as the current/new state, 1 bit for the direction to move the tape, and 3 bits for the symbol to read/write, and that driver processes the left/right/read/write tape operations and loops back, you have Turing complete system (Wolfram's 2-state 3-symbol Turing machine).

So no, you have not left Turing completeness behind, as any function that can map 3 bits of input to 5 bits of output becomes a Turing complete system if you can put a loop and IO mechanism around it.

Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.

▲

trashtester 2 days ago | parent [-]

> Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.

I know. That is part of my claim that this "talking about Turing completeness" is a distraction. Specifically because it ignores efficiency/speed.

> Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.

And again, I KNOW that it's utterly trivial to create a Turing Complete system. I ALSO know that a Turing Complete system can perform ANY computation (it pretty much defines what a computation is), given enough time and memory/storage.

But if such a calculation takes 10^6 times longer than necessary, it's also utterly useless to approach it in this way.

Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.

This model is intrinsically single threaded, so the global branches/dependencies requirement is trivially satisifed.

Generally, though, if you want to be able to distribute a computation, you have to pretty much never allow the results of a computation of any arbitrary compute thread to affect the next computation on any of the other threads.

NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Also, downstream from this is the hardware optimizations that are needed to even run these algorithms. While you _could_ train any of the current LLM's on large CPU clusters, a direct port would require perhaps 1000x more hardware, electricity, etc than running it on GPU's or TPU's.

Not only that, but if the networks being trained (+ some amount of training data) couldn't fit into the fast GPU/TPU memory during training, but instead had to be swapped in and out of system memory or disk, then that would also cause orders of magnitude of slowdown, even if using GPU/TPU's for the training.

In other words, what we're seeing is a trend towards ever increasing coupling between algorithms being run and the hardware they run on.

When I say that thinking in terms of Turing Completeness is a distraction, it doesn't mean it's wrong.

It's just irrelevant.

▲

vidarh a day ago | parent [-]

> NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point. Because Turing completeness does not mean all, or most, or even any of your computations need to be carried out like in a UTM. It only means it needs the theoretical ability. They can take any set of shortcuts you want.

▲

trashtester a day ago | parent | next [-]

> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point.

I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

> It only means it needs the theoretical ability.

This is absolutely incorporated in my previous post. Which is why I wrote:

>> Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> It only means it needs the theoretical ability. They can take any set of shortcuts you want.

I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

Only by staing within the subset of all possible algorithms that CAN be properly paralellized (and have the proper hardware to run it) can you perform the number of calculations needed to train something like an LLM.

> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice,

Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

Another example would be the brain of a new-born baby. I'm pretty sure such a brain is NOT Turing Complete in any way. It has a very short list of training algorithms that are constanly running as it's developing.

But it can't even run Hello World.

For it to really be Turing Complete, it needs to be able to follow instructions accurately (no halucinations, etc) and also access to infinite storage/tape (or it will be a Finite State Machine). Again, it still doesn't matter if it's Turing Complete in this context.

▲

vidarh 16 hours ago | parent [-]

> I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

Why do you think it is irrelevant? It is what allows us to say with near certainty that dismissing the potential of LLMs to be made to reason is unscientific and irrational.

> I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

But again, we've not sacrificed the ability to run those.

> Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

Maybe or maybe not, because today inference is expensive, but we already are running plenty of algorithms that require many runs, and steadily increasing as inference speed relative to network size is improving.

> On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

And? The specific code used to run training has no relevance to the limitations of the model architecture.

	▲	trashtester 15 hours ago \| parent [-]
		See my other response above, I think I've identified what part of my argument was unclear. The update may still have claims in it that you disagree with, but those are specific and (at some point in the future) probably testable.

▲

trashtester 15 hours ago | parent | prev [-]

First I would like to thank you for bein patient with me. After some contemplation, I think I've identified what aspect of my argument hasn't been properly justified, which causes this kind of discussion.

Let's first define C as the set of all algorithms that are computable by any Turing complete system.

The main attractive feature of Turing Completeness is specifically this universality. You can take an algorithm running on one Turing Complete system and port it to another, with some amount of translation work (often just a compiler)

Now let's define the subset of all algorithms C that we are not to properly parallelize, let's label it U. (U is a subset of C).

The complementary subset of C, that CAN be parallelized properly we label P (P is also a subset of C).

Now define algorithms that require a lot of compute (>= 10^20 steps or so) as L. L is also a subset of C.

The complementary ("small" computations) can be labelled S (< 10^20 steps, though the exact cutoff is a bit arbitrary).

Now we define the intersections of S, L, U, P:

(Edit: changed union to intersect)

S_P (intersect of S and P)

L_P (intersect of L and P)

S_U (intersect of S and U)

L_U (intersect of L and U)

For S_P and S_U, the advantages of Turing Completeness remains. L_U is going to be hard to compute on any Turing Complete system.

(Edit: The mistake in my earlier argument was to focus on L_U. L_U is irrelevant for the relevance of the universality of Turing Complete systems, since no such system can run such calculations in a reasonable amount of time, anyway. To run algorithms in the L_U domain would require either some fundamental breakthrough in "single threaded" performance, Quantum Computing or some kind of magic/soul, etc)

This leaves us with L_P. There are computations/algorithms that CAN be parallelized, at least in principle. I will only discuss these from here on.

My fundamental claim is as follows:

While algorithms/computations that belong to the L_P set ARE in theory computable on any Turing Complete system, the variation in how long it takes to compute them can vary so much between different Turing Complete systems that this "universality" stops having practical relevance

For instance, let's say two computers K1 and K2 can run one such algorithm (lp_0, for instance) at the same speed. But on other algorithms (lp_1 and lp_2) the difference between how fast those system can run the computation can vary by a large factor (for instance 10^6), often in both directions.

Let's say lp_1 is 10^6 times faster on K1 than on K2, while lp_2 is 10^6 faster on K2 than on K1.

(Edit: note that lp_2 will take (10^6)^2 = 10^12 times longer than lp_1 on K1)

While both these algorithms are in theory computable on both K1 and K2, this is now of very little practical importance. You always want to run lp_1 on K1 and lp_2 on K2.

Note that I never say that either K1 or K2 are (Edit) not Turing complete. But the main attraction of Turing Completeness is now of no practical utility, namely the ability to move an algorithm from one Turing Complete system to another.

Which also means that what you really care about is not Turing Completeness at all. You care about the ability to calculate lp_1 and lp_2 within a reasonable timeframe, days to years, not decades to beyond eons.

And this is why I'm claiming that this is a paradigm shift. The Turing Completeness ideas were never wrong, they just stopped being useful in the kind of computations that are getting most of the attention now.

Instead, we're moving into a reality where computers are to an ever greater degree specialized for a single purpose, while the role of general purpose computers is fading.

And THIS is where I think my criticism of Deutsch is still accurate. Or rather, if we assume that the human brain belongs to the L_P set and strongly depend on it's hardware for doing what it is doing, this creates a strong constraint on the types of hardware that the human mind can conceivably be uploaded to.

And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

And I think where Deutsch goes wrong, is that he thinks that our conscious thoughts are where our real cognition is going on. My intuition is that while our internal monologue operates at around 1-100 operations per second, our subconscious is requiring in the range of gigaflops to exaflops for our minds to be able to operate in real time.

▲

vidarh 13 hours ago | parent [-]

So, the reason I think the argument on Turing completeness matters here, is that if we accept that an LLM and a brain are both Turing complete, then while you're right that there can be Turing complete systems that are so different in performance characteristics that some are entirely impractical (Turings original UTM is an example of a Turing complete system that is too slow for practical use), if they are both Turing complete the brain is then both an existence-proof for the ability of Turing machines to be made to reason and gives us an upper limit in terms of volume and power needed to achieve human-level reasoning.

It may take us a long time to get there (it's possible we never will), and it may take significant architectural improvements (so it's not a given current LLM architectures can compete on performance), but if both are Turing complete (and not more) then dismissing the human ability to do so is

It means those who dismiss LLM's as "just" next token predictors assuming that this says something about the possibility of reasoning don't have a leg to stand on. And this is why the Turing completeness argument matters to me. I regularly get in heated arguments with people who get very upset at the notion that LLMs can possibly ever reason - this isn't a hypothetical.

> And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

If you mean "act as" in the sense of following operations of a Turing-style UTM with tape, then sure, that will be impractically slow for pretty much everything. Our ability to do so (and the total lack of evidence that we can do anything which exceeds Turing completeness) just serves as a simple way of proving we are Turing complete. In practice, we do most things in far faster ways than simulating a UTM. But I agree with you that I don't think humans can compete with computers in the long run irrespective of the type of problem.

	▲	trashtester 12 hours ago \| parent [-]
		Ok, so it looks like you think you've been arguing against someone who doubt that LLM's (and similar NN's) cannot match the capabilities of humans. In fact, I'm probably on the other side from you compared to them. Now let's first look at how LLM's operate in practice: Current LLM's will generally run on some compute cluster, often with some VM layer (and sometimes maybe barebone), followed by an OS on each node, and then Torch/TensorFlow etc to synchronize them. It doesn't affect the general argument if we treat the whole inference system (the training system is similar) as one large Turing Complete system. Since the LLM's have from billions to trillions of weights, I'm going to assume that for each token produced by the LLM it will perform 10^12 FP calculations. Now, let's assume we want to run the LLM itself as a Turing Machine. Kind of like a virtual machine INSIDE the compute cluster. A single floating point multiplication may require in the order of 10^3 tokens. In other words, by putting 10^15 floating point operations in, we can get 1 floating point operation out. Now this LLM COULD run any other LLM inside it (if we chose to ignore memory limitations). But it would take at minimum in the order of 10^15 times longer to run than the first LLM. My model of the brain is similar. We have a substrate (the physical brain) that runs a lot of computation, one tiny part of that is the ability that trained adults can get to perform any calculation (making us Turing Complete). But compared to the number of raw calculations required by the substrate, our capability to perform universal computation is maybe 1 : 10^15, like the LLM above. Now, I COULD be wrong in this. Maybe there is some way for LLM's to achieve full access to the underlying hardware for generic computation (if only the kinds of computations other computers can perform). But it doesn't seem that way for me, neither for current generation LLM's nor human brains. Also, I don't think it matters. Why would we build an LLM to do the calculations when it's much more efficient to build hardware specifically to perform such computations, without the hassle of running it inside an LLM? The exact computer that we run the LLM (above) on would be able to load other LLM's directly instead of using an intermediary LLM as a VM, right? It's still not clear to me where this is not obvious.... My speculation, though, is that there is an element of sunk cost fallacy involved. Specifically for people my (and I believe) your age that had a lot of our ideas about these topics formed in the 90s and maybe 80s/70s. Go back 25+ years, and I would agree to almost everything you write. At the time computers mostly did single threaded processing, and naïve extrapolation might indicate that the computers of 2030-2040 would reach human level computation ability in a single thread. In such a paradigm, every computer of approximately comparable total power would be able to run the same algorithms. But that stopped being the case around 10 years ago, and the trend seems to continue to be in the direction of purpose-specific hardware taking over from general purpose machines. Edit: To be specific, the sunk cost fallacy enters here because people have been having a lot of clever ideas that depend on the principle of Turing Completeness, like the ability to easily upload minds to computers, or to think of our mind as a barebone computer (not like an LLM, but more like KVM or a Blank Slate), where we can plug in any kind of culture, memes, etc.

▲

dartos 4 days ago | parent | prev [-]

People can communicate each step, and review each step as that communication is happening.

LLMs must be prompted for everything and don’t act on their own.

The value in the assertion is in preventing laymen from seeing a statistical guessing machine be correct and assuming that it always will be.

It’s dangerous to put so much faith in what in reality is a very good guessing machine. You can ask it to retrace its steps, but it’s just guessing at what it’s steps were, since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.

▲

brookst 4 days ago | parent | next [-]

> since it didn’t actually go through real reasoning, just generated text that reads like reasoning steps.

Can you elaborate on the difference? Are you bringing sentience into it? It kind of sounds like it from "don't act on their own". But reasoning and sentience are wildly different things.

> It’s dangerous to put so much faith in what in reality is a very good guessing machine

Yes, exactly. That's why I think it is good we are supplementing fallible humans with fallible LLMs; we already have the processes in place to assume that not every actor is infallible.

	▲	david-gpu 4 days ago \| parent \| next [-]
		So true. People who argue that we should not trust/use LLMs because they sometimes get it wrong are holding them to a higher standard than people -- we make mistakes too! Do we blindly trust or believe every single thing we hear from another person? Of course not. But hearing what they have to say can still be fruitful, and it is not like we have an oracle at our disposal who always speaks the absolute truth, either. We make do with what we have, and LLMs are another tool we can use.
	▲	vundercind 3 days ago \| parent \| prev [-]
		> Can you elaborate on the difference? They’ll fail in different ways than something that thinks (and doesn’t have some kind of major disease of the brain going on) and often smack in the middle of appearing to think.

▲

ben_w 4 days ago | parent | prev | next [-]

> People can communicate each step, and review each step as that communication is happening.

Can, but don't by default. Just as LLMs can be asked for chain of thought, but the default for most users is just chat.

This behaviour of humans is why we software developers have daily standup meetings, version control, and code review.

> LLMs must be prompted for everything and don’t act on their own

And this is why we humans have task boards like JIRA, and quarterly goals set by management.

▲

vidarh 3 days ago | parent | prev | next [-]

LLMs "don't act on their own" because we only reanimate them when we want something from them. Nothing stops you from wiring up an LLM to keep generating, and feeding it sensory inputs to keep it processing. In other words, that's a limitation of the harness we put them in, not of LLMs.

As for people communicating each step, we have plenty of experiments showing that it's pretty hard to get people to reliably report what they actually do as opposed to a rationalization of what they've actually done (e.g. split brain experiments have shown both your brain halves will happily lie about having decided to do things they haven't done if you give them reason to think they've done something)

You can categorically not trust peoples reasoning about "why" they've made a decision to reflect what actually happened in their brain to make them do something.

▲

int_19h 3 days ago | parent | prev [-]

A human brain in a vat doesn't act on its own, either.

▲

Workaccount2 4 days ago | parent | prev [-]

Maybe I am not understanding the paper correctly, but it seems they tested "state of the art models" which is almost entirely composed of open source <27B parameter models. Mostly 8B and 3B models. This is kind of like giving algebra problems to 7 year olds to "test human algebra ability."

If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch.

I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion.

▲

dartos 4 days ago | parent [-]

It’s difficult to reproducibly test openai models, since they can change from under you and you don’t have control over every hyperparameter.

It would’ve been nice to see one of the larger llama models though.

	▲	og_kalu 4 days ago \| parent [-]
		The results are there, it's just hidden away in the appendix. The result is that those models they don't actually suffer drops on 4/5 of their modified benchmarks. The one benchmark that does see actual drops that aren't explained by margin of error is the benchmark that adds "seemingly relevant but ultimately irrelevant information to problems" Those results are absent from the conclusion because the conclusion falls apart otherwise.

▲

dartos 4 days ago | parent | prev | next [-]

There isn’t much utility, but tbf the outputs aren’t identical.

One danger is the human assumption that, since something appears to have that capability in some settings, it will have that capability in all settings.

Thats a recipe for exploding bias, as we’ve seen with classic statistical crime detection systems.

▲

NBJack 4 days ago | parent | prev [-]

Inferring patterns in unfamiliar problems.

Take a common word problem in a 5th grade math text book. Now, change as many words as possible; instead of two trains, make it two different animals; change the location to a rarely discussed town; etc. Even better, invent words/names to identify things.

Someone who has done a word problem like that will very likely recognize the logic, even if the setting is completely different.

Word tokenization alone should fail miserably.

▲

djmips 4 days ago | parent | next [-]

I have noted over my life that a lot of problems end up being a variation on solved problems from another more familiar domain but frustratingly take a long time to solve before realizing this was just like that thing you had already solved. Nevertheless, I do feel like humans do benefit from identifying meta patterns but as the chess example shows even we might be weak in unfamiliar areas.

	▲	Propelloni 4 days ago \| parent [-]
		Learn how to solve one problem and apply the approach, logic and patterns to different problems. In German that's called "Transferleistung" (roughly "transfer success") and a big thing at advanced schools. Or, at least my teacher friends never stop talking about it. We get better at it over time, as probably most of us can attest.

▲

roywiggins 3 days ago | parent | prev [-]

A lot of LLMs do weird things on the question "A farmer needs to get a bag of grain across a river. He has a boat that can transport himself and the grain. How does he do this?"

(they often pattern-match on the farmer/grain/sheep/fox puzzle and start inventing pointless trips ("the farmer returns alone. Then, he crosses again.") in a way that a human wouldn't)

▲

Sharlin 4 days ago | parent | prev | next [-]

What proof do you have that human reasoning involves "symbolic logic and abstractions"? In daily life, that is, not in a math exam. We know that people are actually quite bad at reasoning [1][2]. And it definitely doesn't seem right to define "reasoning" as only the sort that involves formal logic.

[1] https://en.wikipedia.org/wiki/List_of_fallacies

[2] https://en.wikipedia.org/wiki/List_of_cognitive_biases

▲

trashtester 4 days ago | parent [-]

Some very intelligent people, including Gödel and Penrose, seem to think that humans have some kind of ability to arrive directly on correct propositions in ways that bypass the incompleteness theorem. Penrose seems to think this can be due to Quantum Mechanics, Göder may have thought it came frome something divine.

While I think they're both wrong, a lot of people seem to think they can do abstract reasoning for symbols or symbol-like structures without having to use formal logic for every step.

Personally, I think such beliefs about concepts like consciousness, free will, qualia and emotions emerge from how the human brain includes a simplified version of itself when setting up a world model. In fact, I think many such elements are pretty much hard coded (by our genes) into the machinery that human brains use to generate such world models.

Indeed, if this is true, concepts like consciousness, free will, various qualia and emotions can in fact be considered "symbols" within this world model. While the full reality of what happens in the brain when we exercise what we represent by "free will" may be very complex, the world model may assign a boolean to each action we (and others) perform, where the action is either grouped into "voluntary action" or "involuntary action".

This may not always be accurate, but it saves a lot of memory and compute costs for the brain when it tries to optimize for the future. This optimization can (and usually is) called "reasoning", even if the symbols have only an approximated correspondence with physical reality.

For instance, if in our world model somebody does something against us and we deem that it was done exercising "free will", we will be much more likely to punish them than if we categorize the action as "forced".

And on top of these basic concepts within our world model, we tend to add a lot more, also in symbol form, to enable us to use symbolic reasoning to support our interactions with the world.

▲

TeMPOraL 3 days ago | parent | next [-]

> While I think they're both wrong, a lot of people seem to think they can do abstract reasoning for symbols or symbol-like structures without having to use formal logic for every step.

Huh.

I don't know bout incompleteness theorem, but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything, they only painstakingly emulate it when forced to.

If anything, "next token prediction" seems much closer to how human thinking works than anything even remotely formal or symbolic that was proposed before.

As for hardcoding things in world models, one thing that LLMs do conclusively prove is that you can create a coherent system capable of encoding and working with meaning of concepts without providing anything that looks like explicit "meaning". Meaning is not inherent to a term, or a concept expressed by that term - it exists in the relationships between an the concept, and all other concepts.

	▲	ben_w 3 days ago \| parent \| next [-]
		> I don't know bout incompleteness theorem, but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything, they only painstakingly emulate it when forced to. Indeed, this is one reason why I assert that Wittgenstein was wrong about the nature of human thought when writing: """If there were a verb meaning "to believe falsely," it would not have any significant first person, present indicative.""" Sure, it's logically incoherent for us to have such a word, but there's what seems like several different ways for us to hold contradictory and incoherent beliefs within our minds.
	▲	trashtester 3 days ago \| parent \| prev [-]
		... but I'd say it's pretty obvious (both in introspection and in observation of others) that people don't naturally use formal logic for anything ... Yes. But some place too much confidence in how "rational" their intuition is, including some of the most intelligent minds the world has seen. Specifically, many operate as if their intuition (that they treat as completely rational) has some kind of supernatural/magic/divine origin, including many who (imo) SHOULD know better. While I think (like you do) that this intuition has a lot in common with LLM's and other NN architectures than pure logic, or even the scientific method.

▲

raincole 3 days ago | parent | prev [-]

> Some very intelligent people, including Gödel and Penrose, seem to think that humans have some kind of ability to arrive directly on correct propositions in ways that bypass the incompleteness theorem. Penrose seems to think this can be due to Quantum Mechanics, Göder may have thought it came frome something divine.

Did Gödel really say this? It sounds like quite a stretch of incompleteness theorem.

It's like saying because halting problem is undecidable, but humans can debug programs, therefore human brains must having some supernatural power.

	▲	trashtester 3 days ago \| parent [-]
		Gödel mostly cared about mathematics. And he seems to have believed that human intuition could "know" propositions to be true, even if they could not be proven logically[1]. It appears that he was religious and probably believed in an immaterial and maybe even divine soul [2]. If so, that may explain why he believed that human intuition could be unburdend by the incompleteness theorem. [1] https://philsci-archive.pitt.edu/9154/1/Nesher_Godel_on_Trut... [2] https://en.wikipedia.org/wiki/G%C3%B6del%27s_ontological_pro...

▲

Uehreka 4 days ago | parent | prev | next [-]

Does anyone have a hard proof that language doesn’t somehow encode reasoning in a deeper way than we commonly think?

I constantly hear people saying “they’re not intelligent, they’re just predicting the next token in a sequence”, and I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”, but I’ve seen enough surprising studies about the nature of free will and such that I no longer put a lot of stock in what seems “obvious” to me about how my brain works.

▲

spiffytech 4 days ago | parent [-]

> I’ll grant that I don’t think of what’s going on in my head as “predicting the next token in a sequence”

I can't speak to whether LLMs can think, but current evidence indicates humans can perform complex reasoning without the use of language:

> Brain studies show that language is not essential for the cognitive processes that underlie thought.

> For the question of how language relates to systems of thought, the most informative cases are cases of really severe impairments, so-called global aphasia, where individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ...

> You can ask them to solve some math problems or to perform a social reasoning test, and all of the instructions, of course, have to be nonverbal because they can’t understand linguistic information anymore. ...

> There are now dozens of studies that we’ve done looking at all sorts of nonlinguistic inputs and tasks, including many thinking tasks. We find time and again that the language regions are basically silent when people engage in these thinking activities.

https://www.scientificamerican.com/article/you-dont-need-wor...

	▲	SAI_Peregrinus 4 days ago \| parent \| next [-]
		I'd say that's a separate problem. It's not "is the use of language necessary for reasoning?" which seems to be obviously answered "no", but rather "is the use of language sufficient for reasoning?".
	▲	cortic 3 days ago \| parent \| prev [-]
		> ..individuals basically lose completely their ability to understand and produce language as a result of massive damage to the left hemisphere of the brain. ... The right hemisphere almost certainly uses internal 'language' either consciously or unconsciously to define objects, actions, intent.. the fact that they passed these tests is evidence of that. The brain damage is simply stopping them expressing that 'language'. But the existence of language was expressed in the completion of the task..

▲

hathawsh 4 days ago | parent | prev | next [-]

I think the question we're grappling with is whether token prediction may be more tightly related to symbolic logic than we all expected. Today's LLMs are so uncannily good at faking logic that it's making me ponder logic itself.

▲

griomnib 4 days ago | parent [-]

I felt the same way about a year ago, I’ve since changed my mind based on personal experience and new research.

▲

hathawsh 4 days ago | parent [-]

Please elaborate.

▲

dartos 4 days ago | parent [-]

I work in the LLM search space and echo OC’s sentiment.

The more I work with LLMs the more the magic falls away and I see that they are just very good at guessing text.

It’s very apparent when I want to get them to do a very specific thing. They get inconsistent about it.

▲

griomnib 4 days ago | parent | next [-]

Pretty much the same, I work on some fairly specific document retrieval and labeling problems. After some initial excitement I’ve landed on using LLM to help train smaller, more focused, models for specific tasks.

Translation is a task I’ve had good results with, particularly mistral models. Which makes sense as it’s basically just “repeat this series of tokens with modifications”.

The closed models are practically useless from an empirical standpoint as you have no idea if the model you use Monday is the same as Tuesday. “Open” models at least negate this issue.

Likewise, I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions, they are solutions up voted by novices.

I find LLM code is not only bad, but when I point this out the LLM then “apologizes” and gives better code. My worry is inexperienced people can’t even spot that and won’t get this best answer.

In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”

▲

blharr 4 days ago | parent | next [-]

There have even been times where an LLM will spit out _the exact same code_ and you have to give it the answer or a hint how to do it better

	▲	david-gpu 4 days ago \| parent [-]
		Yeah. I had the same experience doing code reviews at work. Sometimes people just get stuck on a problem and can't think of alternative approaches until you give them a good hint.

▲

david-gpu 4 days ago | parent | prev | next [-]

> I’ve found LLM code to be of poor quality

Yes. That was my experience with most human-produced code I ran into professionally, too.

> In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”

Yes, that sometimes works with humans as well. Although you usually need to provide more specific feedback to nudge them in the right track. It gets tiring after a while, doesn't it?

▲

dartos 3 days ago | parent [-]

What is the point of your argument?

I keep seeing people say “yeah well I’ve seen humans that can’t do that either.”

What’s the point you’re trying to make?

▲

david-gpu 3 days ago | parent [-]

The point is that the person I responded to criticized LLMs for making the exact sort of mistakes that professional programmers make all the time:

> I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions

Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow" that the commenter was complaining about, with the additional twist that most developers' breadth of knowledge is going to be limited to a very narrow range of APIs/platforms/etc. whereas these LLMs are able to be comparable to decent programmers in just about any API/language/platform, all at once.

I've written code for thirty years and I wish I had the breadth and depth of knowledge of the free version of ChatGPT, even if I can outsmart it in narrow domains. It is already very decent and I haven't even tried more advanced models like o1-preview.

Is it perfect? No. But it is arguably better than most programmers in at least some aspects. Not every programmer out there is Fabrice Bellard.

▲

dartos 3 days ago | parent [-]

But LLMs aren’t people. And people do more than just generate code.

The comparison is weird and dehumanizing.

I, personally, have never worked with someone who consistently puts out code that is as bad as LLM generated code either.

> Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow"

How could you possibly know that?

All these types of arguments come from a belief that your fellow human is effectively useless.

It’s sad and weird.

	▲	david-gpu 3 days ago \| parent [-]
		>> > Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow" > How could you possibly know that? I worked at four multinationals and saw a bunch of their code. Most of it wasn't "the top answer in stack overflow". Was some of the code written by some of the people better than that? Sure. And a lot of it wasn't, in my opinion. > All these types of arguments come from a belief that your fellow human is effectively useless. Not at all. I think the top answers in stack overflow were written by humans, after all. > It’s sad and weird. You are entitled to your own opinion, no doubt about it.

▲

Sharlin 4 days ago | parent | prev [-]

> In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”

These are called "code reviews" and we do that amongst human coders too, although they tend to be less Socratic in nature.

I think it has been clear from day one that LLMs don't display superhuman capabilities, and a human expert will always outdo one in tasks related to their particular field. But the breadth of their knowledge is unparalleled. They're the ultimate jacks-of-all-trades, and the astonishing thing is that they're even "average Joe" good at a vast number of tasks, never mind "fresh college graduate" good.

The real question has been: what happens when you scale them up? As of now it appears that they scale decidedly sublinearly, but it was not clear at all two or three years ago, and it was definitely worth a try.

▲

vidarh 4 days ago | parent | prev [-]

I do contract work in the LLM space which involves me seeing a lot of human prompts, and its made the magic of human reasoning fall away: Humans are shocking bad at reasoning on the large.

One of the things I find extremely frustrating is that almost no research on LLM reasoning ability benchmarks them against average humans.

Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.

▲

meroes 3 days ago | parent | next [-]

Aren’t prompts seeking to offload reasoning though? Is that really a fair data point for this?

	▲	vidarh 3 days ago \| parent [-]
		When people are claiming they can't reason, then yes, benchmarking against average human should be a bare minimum. Arguably they should benchmark against below-average humans too, because the bar where we'd be willing to argue that a human can't reason is very low. If you're testing to see whether it can replace certain types of work, then it depends on where you would normally set the bar for that type of work. You could offload a whole lot of work with something that can reliably reason at below an average human.

▲

dartos 3 days ago | parent | prev [-]

Another one!

What’s the point of your argument?

AI companies: “There’s a new machine that can do reasoning!!!”

Some people: “actually they’re not very good at reasoning”

Some people like you: “well neither are humans so…”

> research on LLM reasoning ability benchmarks them against average humans

Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.

> Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.

So what? How does that assumption make LLMs better?

	▲	vidarh 3 days ago \| parent [-]
		The point of my argument is that the vast majority of tasks we carry out do not require good reasoning, because if they did most humans would be incapable of handling them. The point is also that a whole lot of people claim LLMs can't reason, based on setting the bar at a point where a large portion of humanity wouldn't clear it. If you actually benchmarked against average humans, a whole lot of the arguments against reasoning in LLMs would instantly look extremely unreasonable, and borderline offensive. > Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies. They're currently regularly being benchmarked against expectations most humans can't meet. It'd make the models look a whole lot better.

▲

Scarblac 4 days ago | parent | prev | next [-]

This is the argument that submarines don't really "swim" as commonly understood, isn't it?

	▲	saithound 4 days ago \| parent \| next [-]
		I think so, but the badness of that argument is context-dependent. How about the hypothetical context where 70k+ startups are promising investors that they'll win the 50 meter freestyle in 2028 by entering a fine-tuned USS Los Angeles?
	▲	Jensson 4 days ago \| parent \| prev [-]
		And planes doesn't fly like a bird, it has very different properties and many things birds can do can't be done by a plane. What they do is totally different.

▲

DiogenesKynikos 4 days ago | parent | prev | next [-]

Effective next-token prediction requires reasoning.

You can also say humans are "just XYZ biological system," but that doesn't mean they don't reason. The same goes for LLMs.

▲

griomnib 4 days ago | parent [-]

Take a word problem for example. A child will be told the first step is to translate the problem from human language to mathematical notation (symbolic representation), then solve the math (logic).

A human doesn’t use next token prediction to solve word problems.

▲

Majromax 4 days ago | parent | next [-]

But the LLM isn't "using next-token prediction" to solve the problem, that's only how it's evaluated.

The "real processing" happens through the various transformer layers (and token-wise nonlinear networks), where it seems as if progressively richer meanings are added to each token. That rich feature set then decodes to the next predicted token, but that decoding step is throwing away a lot of information contained in the latent space.

If language models (per Anthropic's work) can have a direction in latent space correspond to the concept of the Golden Gate Bridge, then I think it's reasonable (albeit far from certain) to say that LLMs are performing some kind of symbolic-ish reasoning.

▲

griomnib 4 days ago | parent | next [-]

Anthropic had a vested interest in people thinking Claude is reasoning.

However, in coding tasks I’ve been able to find it directly regurgitating Stack overflow answers (like literally a google search turns up the code).

Giving coding is supposed to be Claude’s strength, and it’s clearly just parroting web data, I’m not seeing any sort of “reasoning”.

LLM may be useful but they don’t think. They’ve already plateaued, and given the absurd energy requirements I think they will prove to be far less impactful than people think.

▲

DiogenesKynikos 4 days ago | parent [-]

The claim that Claude is just regurgitating answers from Stackoverflow is not tenable, if you've spent time interacting with it.

You can give Claude a complex, novel problem, and it will give you a reasonable solution, which it will be able to explain to you and discuss with you.

You're getting hung up on the fact that LLMs are trained on next-token prediction. I could equally dismiss human intelligence: "The human brain is just a biological neural network that is adapted to maximize the chance of creating successful offspring." Sure, but the way it solves that task is clearly intelligent.

▲

griomnib 4 days ago | parent [-]

I’ve literally spent 100s of hours with it. I’m mystified why so many people use the “you’re holding it wrong” explanation when somebody points out real limitations.

▲

int_19h 3 days ago | parent | next [-]

You might consider that other people have also spent hundreds of hours with it, and have seen it correctly solve tasks that cannot be explained by regurgitating something from the training set.

I'm not saying that your observations aren't correct, but this is not a binary. It is entirely possible that the tasks you observe the models on are exactly the kind where they tend to regurgitate. But that doesn't mean that it is all they can do.

Ultimately, the question is whether there is a "there" there at all. Even if 9 times out of 10, the model regurgitates, but that one other time it can actually reason, that means that it is capable of reasoning in principle.

▲

vidarh 4 days ago | parent | prev | next [-]

When we've spent time with it and gotten novel code, then if you claim that doesn't happen, it is natural to say "you're holding it wrong". If you're just arguing it doesn't happen often enough to be useful to you, that likely depends on your expectations and how complex tasks you need it to carry out to be useful.

	▲	3 days ago \| parent [-]
		[deleted]

▲

gonab 4 days ago | parent | prev [-]

In many ways, Claude feels like a miracle to me. I no longer have to stress over semantics or searching for patterns I can recognize and work with, but I’ve never actually coded them myself in that language. Now, I don’t have to waste energy looking up things that I find boring

▲

vrighter 3 days ago | parent | prev [-]

The LLM isn't solving the problem. The LLM is just predicting the next word. It's not "using next-token prediction to solve a problem". It has no concept of "problem". All it can do is predict 1 (one) token that follows another provided set. That running this in a loop provides you with bullshit (with bullshit defined here as things someone or something says neither with good nor bad intent, but just with complete disregard for any factual accuracy or lack thereof, and so the information is unreliable for everyone) does not mean it is thinking.

	▲	DiogenesKynikos 3 days ago \| parent \| next [-]
		All the human brain does is determine how to fire some motor neurons. No, it does not reason. No, the human brain does not "understand" language. It just knows how to control the firing of neurons that control the vocal chords, in order to maximize an endocrine reward function that has evolved to maximize biological fitness. I can speak about human brains the same way you speak about LLMs. I'm sure you can spot the problem in my conclusions: just because the human brain is "only" firing neurons, it does actually develop an understanding of the world. The same goes for LLMs and next-word prediction.
	▲	quacker 3 days ago \| parent \| prev \| next [-]
		I agree with you as far as the current state of LLMs, but I also feel like we humans have preconceived notions of “thought” and “reasoning”, and are a bit prideful of them. We see the LLM sometimes do sort of well at a whole bunch of tasks. But it makes silly mistakes that seem obvious to us. We say, “Ah ha! So it can’t reason after all”. Say LLMs get a bit better, to the point they can beat chess grandmasters 55% of the time. This is quite good. Low level chess players rarely ever beat grandmasters, after all. But, the LLM spits out illegal moves sometimes and sometimes blunders nonsensically. So we say, “Ah ha! So it can’t reason after all”. But what would it matter if it can reason? Beating grandmasters 55% of the time would make it among the best chess players in the world. For now, LLMs just aren’t that good. They are too error prone and inconsistent and nonsensical. But they are also sort weirdly capable at lots of things in strange inconsistent ways, and assuming they continue to improve, I think they will tend to defy our typical notions of human intelligence.
	▲	mhh__ 3 days ago \| parent \| prev [-]
		I don't see why this isn't a good model for how human reasoning happens either, certainly as a first-order assumption (at least).

▲

TeMPOraL 4 days ago | parent | prev | next [-]

> A human doesn’t use next token prediction to solve word problems.

Of course they do, unless they're particularly conscientious noobs that are able to repeatedly execute the "translate to mathematical notation, then solve the math" algorithm, without going insane. But those people are the exception.

Everyone else either gets bored half-way through reading the problem, or has already done dozens of similar problems before, or both - and jump straight to "next token prediction", aka. searching the problem space "by feels", and checking candidate solutions to sub-problems on the fly.

This kind of methodical approach you mention? We leave that to symbolic math software. The "next token prediction" approach is something we call "experience"/"expertise" and a source of the thing we call "insight".

	▲	vidarh 4 days ago \| parent [-]
		Indeed. Work on any project that requires humans to carry out largely repetitive steps, and a large part of the problem involves how to put processes around people to work around humans "shutting off" reasoning and going full-on automatic. E.g. I do contract work on an LLM-related project where one of the systemic changes introduced - in addition to multiple levels of quality checks - is to force to make people input a given sentence word for word followed by a word from a set of 5 or so, and a minority of the submissions get that sentence correct including the final word despite the system refusing to let you submit unless the initial sentence is correct. Seeing the data has been an absolutely shocking indictment of human reasoning. These are submissions from a pool of people who have passed reasoning tests... When I've tested the process myself as well, it takes only a handful of steps before the tendency is to "drift off" and start replacing a word here and there and fail to complete even the initial sentence without a correction. I shudder to think how bad the results would be if there wasn't that "jolt" to try to get people back to paying attention. Keeping humans consistently carrying out a learned process is incredibly hard.

▲

fragmede 4 days ago | parent | prev [-]

is that based on a vigorous understanding of how humans think, derived from watching people (children) learn to solve word problems? How do thoughts get formed? Because I remember being given word problems with extra information, and some children trying to shove that information into a math equation despite it not being relevant. The "think things though" portion of ChatGPT o1-preview is hidden from us, so even though a o1-preview can solve word problems, we don't know how it internally computes to arrive at that answer. But we do we really know how we do it? We can't even explain consciousness in the first place.

▲

olalonde 3 days ago | parent | prev | next [-]

This argument reminds me the classic "intelligent design" critique of evolution: "Evolution can't possibly create an eye; it only works by selecting random mutations." Personally, I don't see why a "next token predictor" couldn't develop the capability to reason and form abstractions.

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

nuancebydefault 4 days ago | parent | prev [-]

After reading the article I am more convinced it does reasoning. The base model's reasoning capabilities are partly hidden by the chatty derived model's logic.