Remix.run Logo
trashtester 2 days ago

> Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.

I know. That is part of my claim that this "talking about Turing completeness" is a distraction. Specifically because it ignores efficiency/speed.

> Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.

And again, I KNOW that it's utterly trivial to create a Turing Complete system. I ALSO know that a Turing Complete system can perform ANY computation (it pretty much defines what a computation is), given enough time and memory/storage.

But if such a calculation takes 10^6 times longer than necessary, it's also utterly useless to approach it in this way.

Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.

This model is intrinsically single threaded, so the global branches/dependencies requirement is trivially satisifed.

Generally, though, if you want to be able to distribute a computation, you have to pretty much never allow the results of a computation of any arbitrary compute thread to affect the next computation on any of the other threads.

NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Also, downstream from this is the hardware optimizations that are needed to even run these algorithms. While you _could_ train any of the current LLM's on large CPU clusters, a direct port would require perhaps 1000x more hardware, electricity, etc than running it on GPU's or TPU's.

Not only that, but if the networks being trained (+ some amount of training data) couldn't fit into the fast GPU/TPU memory during training, but instead had to be swapped in and out of system memory or disk, then that would also cause orders of magnitude of slowdown, even if using GPU/TPU's for the training.

In other words, what we're seeing is a trend towards ever increasing coupling between algorithms being run and the hardware they run on.

When I say that thinking in terms of Turing Completeness is a distraction, it doesn't mean it's wrong.

It's just irrelevant.

vidarh a day ago | parent [-]

> NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point. Because Turing completeness does not mean all, or most, or even any of your computations need to be carried out like in a UTM. It only means it needs the theoretical ability. They can take any set of shortcuts you want.

trashtester a day ago | parent | next [-]

> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point.

I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

> It only means it needs the theoretical ability.

This is absolutely incorporated in my previous post. Which is why I wrote:

>> Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> It only means it needs the theoretical ability. They can take any set of shortcuts you want.

I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

Only by staing within the subset of all possible algorithms that CAN be properly paralellized (and have the proper hardware to run it) can you perform the number of calculations needed to train something like an LLM.

> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice,

Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

Another example would be the brain of a new-born baby. I'm pretty sure such a brain is NOT Turing Complete in any way. It has a very short list of training algorithms that are constanly running as it's developing.

But it can't even run Hello World.

For it to really be Turing Complete, it needs to be able to follow instructions accurately (no halucinations, etc) and also access to infinite storage/tape (or it will be a Finite State Machine). Again, it still doesn't matter if it's Turing Complete in this context.

vidarh 15 hours ago | parent [-]

> I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

Why do you think it is irrelevant? It is what allows us to say with near certainty that dismissing the potential of LLMs to be made to reason is unscientific and irrational.

> I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

But again, we've not sacrificed the ability to run those.

> Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

Maybe or maybe not, because today inference is expensive, but we already are running plenty of algorithms that require many runs, and steadily increasing as inference speed relative to network size is improving.

> On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

And? The specific code used to run training has no relevance to the limitations of the model architecture.

trashtester 15 hours ago | parent [-]

See my other response above, I think I've identified what part of my argument was unclear.

The update may still have claims in it that you disagree with, but those are specific and (at some point in the future) probably testable.

trashtester 15 hours ago | parent | prev [-]

First I would like to thank you for bein patient with me. After some contemplation, I think I've identified what aspect of my argument hasn't been properly justified, which causes this kind of discussion.

Let's first define C as the set of all algorithms that are computable by any Turing complete system.

The main attractive feature of Turing Completeness is specifically this universality. You can take an algorithm running on one Turing Complete system and port it to another, with some amount of translation work (often just a compiler)

Now let's define the subset of all algorithms C that we are not to properly parallelize, let's label it U. (U is a subset of C).

The complementary subset of C, that CAN be parallelized properly we label P (P is also a subset of C).

Now define algorithms that require a lot of compute (>= 10^20 steps or so) as L. L is also a subset of C.

The complementary ("small" computations) can be labelled S (< 10^20 steps, though the exact cutoff is a bit arbitrary).

Now we define the intersections of S, L, U, P:

(Edit: changed union to intersect)

S_P (intersect of S and P)

L_P (intersect of L and P)

S_U (intersect of S and U)

L_U (intersect of L and U)

For S_P and S_U, the advantages of Turing Completeness remains. L_U is going to be hard to compute on any Turing Complete system.

(Edit: The mistake in my earlier argument was to focus on L_U. L_U is irrelevant for the relevance of the universality of Turing Complete systems, since no such system can run such calculations in a reasonable amount of time, anyway. To run algorithms in the L_U domain would require either some fundamental breakthrough in "single threaded" performance, Quantum Computing or some kind of magic/soul, etc)

This leaves us with L_P. There are computations/algorithms that CAN be parallelized, at least in principle. I will only discuss these from here on.

My fundamental claim is as follows:

While algorithms/computations that belong to the L_P set ARE in theory computable on any Turing Complete system, the variation in how long it takes to compute them can vary so much between different Turing Complete systems that this "universality" stops having practical relevance

For instance, let's say two computers K1 and K2 can run one such algorithm (lp_0, for instance) at the same speed. But on other algorithms (lp_1 and lp_2) the difference between how fast those system can run the computation can vary by a large factor (for instance 10^6), often in both directions.

Let's say lp_1 is 10^6 times faster on K1 than on K2, while lp_2 is 10^6 faster on K2 than on K1.

(Edit: note that lp_2 will take (10^6)^2 = 10^12 times longer than lp_1 on K1)

While both these algorithms are in theory computable on both K1 and K2, this is now of very little practical importance. You always want to run lp_1 on K1 and lp_2 on K2.

Note that I never say that either K1 or K2 are (Edit) not Turing complete. But the main attraction of Turing Completeness is now of no practical utility, namely the ability to move an algorithm from one Turing Complete system to another.

Which also means that what you really care about is not Turing Completeness at all. You care about the ability to calculate lp_1 and lp_2 within a reasonable timeframe, days to years, not decades to beyond eons.

And this is why I'm claiming that this is a paradigm shift. The Turing Completeness ideas were never wrong, they just stopped being useful in the kind of computations that are getting most of the attention now.

Instead, we're moving into a reality where computers are to an ever greater degree specialized for a single purpose, while the role of general purpose computers is fading.

And THIS is where I think my criticism of Deutsch is still accurate. Or rather, if we assume that the human brain belongs to the L_P set and strongly depend on it's hardware for doing what it is doing, this creates a strong constraint on the types of hardware that the human mind can conceivably be uploaded to.

And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

And I think where Deutsch goes wrong, is that he thinks that our conscious thoughts are where our real cognition is going on. My intuition is that while our internal monologue operates at around 1-100 operations per second, our subconscious is requiring in the range of gigaflops to exaflops for our minds to be able to operate in real time.

vidarh 13 hours ago | parent [-]

So, the reason I think the argument on Turing completeness matters here, is that if we accept that an LLM and a brain are both Turing complete, then while you're right that there can be Turing complete systems that are so different in performance characteristics that some are entirely impractical (Turings original UTM is an example of a Turing complete system that is too slow for practical use), if they are both Turing complete the brain is then both an existence-proof for the ability of Turing machines to be made to reason and gives us an upper limit in terms of volume and power needed to achieve human-level reasoning.

It may take us a long time to get there (it's possible we never will), and it may take significant architectural improvements (so it's not a given current LLM architectures can compete on performance), but if both are Turing complete (and not more) then dismissing the human ability to do so is

It means those who dismiss LLM's as "just" next token predictors assuming that this says something about the possibility of reasoning don't have a leg to stand on. And this is why the Turing completeness argument matters to me. I regularly get in heated arguments with people who get very upset at the notion that LLMs can possibly ever reason - this isn't a hypothetical.

> And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

If you mean "act as" in the sense of following operations of a Turing-style UTM with tape, then sure, that will be impractically slow for pretty much everything. Our ability to do so (and the total lack of evidence that we can do anything which exceeds Turing completeness) just serves as a simple way of proving we are Turing complete. In practice, we do most things in far faster ways than simulating a UTM. But I agree with you that I don't think humans can compete with computers in the long run irrespective of the type of problem.

trashtester 11 hours ago | parent [-]

Ok, so it looks like you think you've been arguing against someone who doubt that LLM's (and similar NN's) cannot match the capabilities of humans. In fact, I'm probably on the other side from you compared to them.

Now let's first look at how LLM's operate in practice:

Current LLM's will generally run on some compute cluster, often with some VM layer (and sometimes maybe barebone), followed by an OS on each node, and then Torch/TensorFlow etc to synchronize them.

It doesn't affect the general argument if we treat the whole inference system (the training system is similar) as one large Turing Complete system.

Since the LLM's have from billions to trillions of weights, I'm going to assume that for each token produced by the LLM it will perform 10^12 FP calculations.

Now, let's assume we want to run the LLM itself as a Turing Machine. Kind of like a virtual machine INSIDE the compute cluster. A single floating point multiplication may require in the order of 10^3 tokens.

In other words, by putting 10^15 floating point operations in, we can get 1 floating point operation out.

Now this LLM COULD run any other LLM inside it (if we chose to ignore memory limitations). But it would take at minimum in the order of 10^15 times longer to run than the first LLM.

My model of the brain is similar. We have a substrate (the physical brain) that runs a lot of computation, one tiny part of that is the ability that trained adults can get to perform any calculation (making us Turing Complete).

But compared to the number of raw calculations required by the substrate, our capability to perform universal computation is maybe 1 : 10^15, like the LLM above.

Now, I COULD be wrong in this. Maybe there is some way for LLM's to achieve full access to the underlying hardware for generic computation (if only the kinds of computations other computers can perform). But it doesn't seem that way for me, neither for current generation LLM's nor human brains.

Also, I don't think it matters. Why would we build an LLM to do the calculations when it's much more efficient to build hardware specifically to perform such computations, without the hassle of running it inside an LLM?

The exact computer that we run the LLM (above) on would be able to load other LLM's directly instead of using an intermediary LLM as a VM, right?

It's still not clear to me where this is not obvious....

My speculation, though, is that there is an element of sunk cost fallacy involved. Specifically for people my (and I believe) your age that had a lot of our ideas about these topics formed in the 90s and maybe 80s/70s.

Go back 25+ years, and I would agree to almost everything you write. At the time computers mostly did single threaded processing, and naïve extrapolation might indicate that the computers of 2030-2040 would reach human level computation ability in a single thread.

In such a paradigm, every computer of approximately comparable total power would be able to run the same algorithms.

But that stopped being the case around 10 years ago, and the trend seems to continue to be in the direction of purpose-specific hardware taking over from general purpose machines.

Edit: To be specific, the sunk cost fallacy enters here because people have been having a lot of clever ideas that depend on the principle of Turing Completeness, like the ability to easily upload minds to computers, or to think of our mind as a barebone computer (not like an LLM, but more like KVM or a Blank Slate), where we can plug in any kind of culture, memes, etc.