Remix.run Logo
vidarh 3 days ago

> My point is that any such system is extremely limited due to how slow they become at scale (when running algorithms/programs that require full turing completeness), due to it's "single threaded" nature. Such algorithms simply are not very parallelizable.

My point is that this isn't true. Every computer you've ever used is a Turing complete system, and there is no need for such a system to be single-threaded, as a multi-threaded system can simulate a single-threaded system and vice versa.

> I know that a GPU's CAN in principle emulate a Turing Complete system, but they're even slower at it than CPU's, so that's irrelevant. The same goes for human brains.

Any system that can emulate any other Turing complete system is Turing complete, so they are Turing complete.

You seem to confuse Turing completeness with a specific way of executing something. Turing completeness is about the theoretical limits on which set of functions a system can execute, not how they execute them.

trashtester 2 days ago | parent [-]

> My point is that this isn't true. Every computer you've ever used is a Turing complete system, and there is no need for such a system to be single-threaded, as a multi-threaded system can simulate a single-threaded system and vice versa.

Not all algorithms can be distributed effectively across multiple threads.

A computer can have 1000 cpu cores, and only be able to use a single one when running such algorithms.

Some other algorithms may be distributed through branch predictions, by trying to run future processing steps ahead of time for each possible outcome of the current processing step. In fact, modern CPU's already do this a lot to speed up processing.

But even branch prediction hits something like a logarithmic wall of diminishing returns.

While you are right that multi core CPU's (or whole data centers) can run such algorithms, that doesn't mean they can run them quickly, hence my claim:

>> My point is that any such system is extremely limited due to how slow

Algorithms that can only utilize a single core seem to be stuck at the GFLOPS scale, regardless of what hardware they run on.

Even if only a small percentage (like 5%) of the code in a program is inherently limited to being single threaded (At best, you will achieve TFlops numbers), this imposes a fatal limitation on computational problems that require very large amounts of computing power. (For instance at the ExaFlop scale or higher.)

THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.

So if you want to do calculations that require, say, 1 ExaFlop (about the raw compute of the human brain) to be fast enough for a given purpose, you need to make almost all compute steps fully parallelizable.

Now that you've restricted your algorithm to no longer require all features of a Turing Complete system, you CAN still run it on Turing Complete CPU's. You're just not MAKING USE of their Turing Completeness. That's just very expensive.

At this point, you may as well build dedicated hardware that do not have all the optimizations that CPU have for single threaded computation, like GPU's or TPU's, and lower your compute cost by 10x, 100x or more (which could be the difference between $500 billion and $billion).

At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.

One way to describe this, is that from the space of all possible algorithms that can run on a Turing Complete system, you've selected a small sub-space of algorithms that can be parallelized.

By doing this trade, you've severely limited what algorithms you can run, in exchange for the ability to get a speedup of 6 orders of magnitude, or more in many cases.

And in order to keep this speedup, you also have to accept other hardware based limitations, such as staying within the amount of GPU memory available, etc.

Sure, you COULD train GPT-5 or Grok-3 on a C-64 with infinite casette tape storage, but it would take virtually forever for the training to finish. So that fact has no practical utility.

I DO realize that the concept of the equivalence of all Turing Complete systems is very beautiful. But this CAN be a distraction, and lead to intuitions that seem to me to be completely wrong.

Like Deutsch's idea that the ideas in a human brain are fundamentally linked to the brain's Turing Completeness. While in reality, it takes years of practice for a child to learn how to be Turing Complete, and even then the child's brain will struggle to do a floating point calculation every 5 minutes.

Meanwhile, joint systems of algorithms and the hardware they run on can do very impressive calculations when ditching some of the requirements of Turing Completeness.

vidarh 2 days ago | parent [-]

> Not all algorithms can be distributed effectively across multiple threads.

Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.

> THIS is the flaw of the Turing Completness idea. Algorithms that REQUIRE the full feature set of Turing Completeness are in some cases extremely slow.

The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.

It's not about whether the algorithms require a Turing complete system, but that Turing completeness proves the equivalence of the upper limit of which set of functions the architecture can compute, and that pretty much any meaningful architecture you will come up with is still Turing complete.

> At this point, you've left the Turing Completness paradigm fully behind. Though the real shift happened when you removed those requirements from your algorithm, not when you shifted the hardware.

If a system can take 3 bits of input and use it to look up 5 bits of output in a table of 30 bits of data, and it is driven by a driver that uses 1 bit of the input as the current/new state, 1 bit for the direction to move the tape, and 3 bits for the symbol to read/write, and that driver processes the left/right/read/write tape operations and loops back, you have Turing complete system (Wolfram's 2-state 3-symbol Turing machine).

So no, you have not left Turing completeness behind, as any function that can map 3 bits of input to 5 bits of output becomes a Turing complete system if you can put a loop and IO mechanism around it.

Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.

trashtester 2 days ago | parent [-]

> Sure, but talking about Turing completeness is not about efficiency, but about the computational ability of a system.

I know. That is part of my claim that this "talking about Turing completeness" is a distraction. Specifically because it ignores efficiency/speed.

> Again, the point is not that this is a good way of doing something, but that it serves as a way to point out that what it takes to make an LLM Turing complete is so utterly trivial.

And again, I KNOW that it's utterly trivial to create a Turing Complete system. I ALSO know that a Turing Complete system can perform ANY computation (it pretty much defines what a computation is), given enough time and memory/storage.

But if such a calculation takes 10^6 times longer than necessary, it's also utterly useless to approach it in this way.

Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> The "feature set" of Turing Completeness can be reduced to a loop, an array lookup, and an IO port.

This model is intrinsically single threaded, so the global branches/dependencies requirement is trivially satisifed.

Generally, though, if you want to be able to distribute a computation, you have to pretty much never allow the results of a computation of any arbitrary compute thread to affect the next computation on any of the other threads.

NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Also, downstream from this is the hardware optimizations that are needed to even run these algorithms. While you _could_ train any of the current LLM's on large CPU clusters, a direct port would require perhaps 1000x more hardware, electricity, etc than running it on GPU's or TPU's.

Not only that, but if the networks being trained (+ some amount of training data) couldn't fit into the fast GPU/TPU memory during training, but instead had to be swapped in and out of system memory or disk, then that would also cause orders of magnitude of slowdown, even if using GPU/TPU's for the training.

In other words, what we're seeing is a trend towards ever increasing coupling between algorithms being run and the hardware they run on.

When I say that thinking in terms of Turing Completeness is a distraction, it doesn't mean it's wrong.

It's just irrelevant.

vidarh a day ago | parent [-]

> NOBODY would be able to train LLM's that are anything like the ones we see today, if they were not willing to make that sacrifice.

Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point. Because Turing completeness does not mean all, or most, or even any of your computations need to be carried out like in a UTM. It only means it needs the theoretical ability. They can take any set of shortcuts you want.

trashtester a day ago | parent | next [-]

> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice, is the point.

I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

> It only means it needs the theoretical ability.

This is absolutely incorporated in my previous post. Which is why I wrote:

>> Specifically, the problem with Turing Completeness is that it implies the ability to create global branches/dependencies in the code based on the output of any previous computation step.

> It only means it needs the theoretical ability. They can take any set of shortcuts you want.

I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

Only by staing within the subset of all possible algorithms that CAN be properly paralellized (and have the proper hardware to run it) can you perform the number of calculations needed to train something like an LLM.

> Every LLM we have today is Turing complete if you put a loop around it that uses context as a means to continue the state transitions so they haven't made that sacrifice,

Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

Another example would be the brain of a new-born baby. I'm pretty sure such a brain is NOT Turing Complete in any way. It has a very short list of training algorithms that are constanly running as it's developing.

But it can't even run Hello World.

For it to really be Turing Complete, it needs to be able to follow instructions accurately (no halucinations, etc) and also access to infinite storage/tape (or it will be a Finite State Machine). Again, it still doesn't matter if it's Turing Complete in this context.

vidarh 16 hours ago | parent [-]

> I don't think you understood what I was writing. I wasn't saying that either the LLM (finished product OR the machines used for training them) were not Turing Complete. I said it was irrelevant.

Why do you think it is irrelevant? It is what allows us to say with near certainty that dismissing the potential of LLMs to be made to reason is unscientific and irrational.

> I'm not talking about shortcuts. When I talk about sacrificing, I'm talking about algorithms that you can run on any Turing Complete machine that are (to our knowledge) fundamentally impossible to distribute properly, regardless of shortcuts.

But again, we've not sacrificed the ability to run those.

> Which, to the degree that it's true, is irrelevant for the reason that I'm saying Turing Completeness is a distraction. You're not likely to run algorithms that require 10^20 to 10^25 steps within the context of an LLM.

Maybe or maybe not, because today inference is expensive, but we already are running plenty of algorithms that require many runs, and steadily increasing as inference speed relative to network size is improving.

> On the other hand, if you make a cluster to train LLM's that is explicitly NOT Turing Complete (it can be designed to refuse to run code that is not fully parallel to avoid costs in the millions just to have a single cuda run activated, for instance), it can still be just as good at it's dedicated task (training LLM)s.

And? The specific code used to run training has no relevance to the limitations of the model architecture.

trashtester 15 hours ago | parent [-]

See my other response above, I think I've identified what part of my argument was unclear.

The update may still have claims in it that you disagree with, but those are specific and (at some point in the future) probably testable.

trashtester 15 hours ago | parent | prev [-]

First I would like to thank you for bein patient with me. After some contemplation, I think I've identified what aspect of my argument hasn't been properly justified, which causes this kind of discussion.

Let's first define C as the set of all algorithms that are computable by any Turing complete system.

The main attractive feature of Turing Completeness is specifically this universality. You can take an algorithm running on one Turing Complete system and port it to another, with some amount of translation work (often just a compiler)

Now let's define the subset of all algorithms C that we are not to properly parallelize, let's label it U. (U is a subset of C).

The complementary subset of C, that CAN be parallelized properly we label P (P is also a subset of C).

Now define algorithms that require a lot of compute (>= 10^20 steps or so) as L. L is also a subset of C.

The complementary ("small" computations) can be labelled S (< 10^20 steps, though the exact cutoff is a bit arbitrary).

Now we define the intersections of S, L, U, P:

(Edit: changed union to intersect)

S_P (intersect of S and P)

L_P (intersect of L and P)

S_U (intersect of S and U)

L_U (intersect of L and U)

For S_P and S_U, the advantages of Turing Completeness remains. L_U is going to be hard to compute on any Turing Complete system.

(Edit: The mistake in my earlier argument was to focus on L_U. L_U is irrelevant for the relevance of the universality of Turing Complete systems, since no such system can run such calculations in a reasonable amount of time, anyway. To run algorithms in the L_U domain would require either some fundamental breakthrough in "single threaded" performance, Quantum Computing or some kind of magic/soul, etc)

This leaves us with L_P. There are computations/algorithms that CAN be parallelized, at least in principle. I will only discuss these from here on.

My fundamental claim is as follows:

While algorithms/computations that belong to the L_P set ARE in theory computable on any Turing Complete system, the variation in how long it takes to compute them can vary so much between different Turing Complete systems that this "universality" stops having practical relevance

For instance, let's say two computers K1 and K2 can run one such algorithm (lp_0, for instance) at the same speed. But on other algorithms (lp_1 and lp_2) the difference between how fast those system can run the computation can vary by a large factor (for instance 10^6), often in both directions.

Let's say lp_1 is 10^6 times faster on K1 than on K2, while lp_2 is 10^6 faster on K2 than on K1.

(Edit: note that lp_2 will take (10^6)^2 = 10^12 times longer than lp_1 on K1)

While both these algorithms are in theory computable on both K1 and K2, this is now of very little practical importance. You always want to run lp_1 on K1 and lp_2 on K2.

Note that I never say that either K1 or K2 are (Edit) not Turing complete. But the main attraction of Turing Completeness is now of no practical utility, namely the ability to move an algorithm from one Turing Complete system to another.

Which also means that what you really care about is not Turing Completeness at all. You care about the ability to calculate lp_1 and lp_2 within a reasonable timeframe, days to years, not decades to beyond eons.

And this is why I'm claiming that this is a paradigm shift. The Turing Completeness ideas were never wrong, they just stopped being useful in the kind of computations that are getting most of the attention now.

Instead, we're moving into a reality where computers are to an ever greater degree specialized for a single purpose, while the role of general purpose computers is fading.

And THIS is where I think my criticism of Deutsch is still accurate. Or rather, if we assume that the human brain belongs to the L_P set and strongly depend on it's hardware for doing what it is doing, this creates a strong constraint on the types of hardware that the human mind can conceivably be uploaded to.

And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

And I think where Deutsch goes wrong, is that he thinks that our conscious thoughts are where our real cognition is going on. My intuition is that while our internal monologue operates at around 1-100 operations per second, our subconscious is requiring in the range of gigaflops to exaflops for our minds to be able to operate in real time.

vidarh 13 hours ago | parent [-]

So, the reason I think the argument on Turing completeness matters here, is that if we accept that an LLM and a brain are both Turing complete, then while you're right that there can be Turing complete systems that are so different in performance characteristics that some are entirely impractical (Turings original UTM is an example of a Turing complete system that is too slow for practical use), if they are both Turing complete the brain is then both an existence-proof for the ability of Turing machines to be made to reason and gives us an upper limit in terms of volume and power needed to achieve human-level reasoning.

It may take us a long time to get there (it's possible we never will), and it may take significant architectural improvements (so it's not a given current LLM architectures can compete on performance), but if both are Turing complete (and not more) then dismissing the human ability to do so is

It means those who dismiss LLM's as "just" next token predictors assuming that this says something about the possibility of reasoning don't have a leg to stand on. And this is why the Turing completeness argument matters to me. I regularly get in heated arguments with people who get very upset at the notion that LLMs can possibly ever reason - this isn't a hypothetical.

> And vice versa. While Deutsch tends to claim that humans will be able to run any computation that ASI will run in the future, I would claim that to the extent a human is able to act as Turing Complete, such computations may take more than a lifetime and possibly longer than the time until the heat death of the universe.

If you mean "act as" in the sense of following operations of a Turing-style UTM with tape, then sure, that will be impractically slow for pretty much everything. Our ability to do so (and the total lack of evidence that we can do anything which exceeds Turing completeness) just serves as a simple way of proving we are Turing complete. In practice, we do most things in far faster ways than simulating a UTM. But I agree with you that I don't think humans can compete with computers in the long run irrespective of the type of problem.

trashtester 12 hours ago | parent [-]

Ok, so it looks like you think you've been arguing against someone who doubt that LLM's (and similar NN's) cannot match the capabilities of humans. In fact, I'm probably on the other side from you compared to them.

Now let's first look at how LLM's operate in practice:

Current LLM's will generally run on some compute cluster, often with some VM layer (and sometimes maybe barebone), followed by an OS on each node, and then Torch/TensorFlow etc to synchronize them.

It doesn't affect the general argument if we treat the whole inference system (the training system is similar) as one large Turing Complete system.

Since the LLM's have from billions to trillions of weights, I'm going to assume that for each token produced by the LLM it will perform 10^12 FP calculations.

Now, let's assume we want to run the LLM itself as a Turing Machine. Kind of like a virtual machine INSIDE the compute cluster. A single floating point multiplication may require in the order of 10^3 tokens.

In other words, by putting 10^15 floating point operations in, we can get 1 floating point operation out.

Now this LLM COULD run any other LLM inside it (if we chose to ignore memory limitations). But it would take at minimum in the order of 10^15 times longer to run than the first LLM.

My model of the brain is similar. We have a substrate (the physical brain) that runs a lot of computation, one tiny part of that is the ability that trained adults can get to perform any calculation (making us Turing Complete).

But compared to the number of raw calculations required by the substrate, our capability to perform universal computation is maybe 1 : 10^15, like the LLM above.

Now, I COULD be wrong in this. Maybe there is some way for LLM's to achieve full access to the underlying hardware for generic computation (if only the kinds of computations other computers can perform). But it doesn't seem that way for me, neither for current generation LLM's nor human brains.

Also, I don't think it matters. Why would we build an LLM to do the calculations when it's much more efficient to build hardware specifically to perform such computations, without the hassle of running it inside an LLM?

The exact computer that we run the LLM (above) on would be able to load other LLM's directly instead of using an intermediary LLM as a VM, right?

It's still not clear to me where this is not obvious....

My speculation, though, is that there is an element of sunk cost fallacy involved. Specifically for people my (and I believe) your age that had a lot of our ideas about these topics formed in the 90s and maybe 80s/70s.

Go back 25+ years, and I would agree to almost everything you write. At the time computers mostly did single threaded processing, and naïve extrapolation might indicate that the computers of 2030-2040 would reach human level computation ability in a single thread.

In such a paradigm, every computer of approximately comparable total power would be able to run the same algorithms.

But that stopped being the case around 10 years ago, and the trend seems to continue to be in the direction of purpose-specific hardware taking over from general purpose machines.

Edit: To be specific, the sunk cost fallacy enters here because people have been having a lot of clever ideas that depend on the principle of Turing Completeness, like the ability to easily upload minds to computers, or to think of our mind as a barebone computer (not like an LLM, but more like KVM or a Blank Slate), where we can plug in any kind of culture, memes, etc.