Since LLM’s aren’t deterministic isn’t it impossible? What would keep it from iterating back and forth between two failing states forever? Is this the halting problem?

▲

daxfohl a day ago | parent | next [-]

I'd suggest the problem isn't that LLMs are nondeterministic. It's that English is.

With a coding language, once you know the rules, there's no two ways to understand the instructions. It does what it says. With English, good luck getting everyone and the LLM to agree on what every word means.

Going with LLM as a compiler, I expect by the time you get the English to be precise enough to be "compiled", the document will be many times larger than the resulting code, no longer be a reasonable requirements doc because it reads like code, but also inscrutable to engineers because it's so verbose.

▲

dworks a day ago | parent | next [-]

Sure, we cannot agree on the correct interpretation of the instructions. But, we also cannot define what is correct output.

First, the term “accuracy” is somewhat meaningless when it comes to LLMs. Anything that an LLM outputs is by definition “accurate” or “correct” from a technical point of view because it was produced by the model. The term accuracy then is not a technical or perhaps even factual term, but a sociological and cultural term, where what is right or wrong is determined by society, and even we sometimes have a hard time determining what is true or note (see: philosophy).

▲

miningape a day ago | parent | next [-]

What? What does philosophy have to do with anything?

If you cannot agree on the correct interpretation, nor output, what stops an LLM from solving the wrong problem? what stops an LLM from "compiling" the incorrect source code? What even makes it possible for us to solve a problem? If I ask an LLM to add a column to a table and it drops the table it's a critical failure - not something to be reinterpreted as a "new truth".

Philosophical arguments are fine when it comes to loose concepts like human language (interpretive domains). On the other hand computer languages are precise and not open to interpretation (formal domains) - so philosophical arguments cannot be applied to them (only applied to the human interpretation of code).

It's like how mathematical "language" (again a formal domain) describes precise rulesets (axioms) and every "fact" (theorem) is derived from them. You cannot philosophise your way out of the axioms being the base units of expression, you cannot philosophise a theorem into falsehood (instead you must show through precise mathematical language why a theorem breaks the axioms). This is exactly why programming, like mathematics, is a domain where correctness is objective and not something that can be waved away with philosophical reinterpretation. (This is also why the philosophy department is kept far away from the mathematics department)

▲

dworks a day ago | parent [-]

Looks like you misunderstood my comment. My point is that both input and output is too fuzzy for an LLM to be reliable in an automated system.

"Truth is one of the central subjects in philosophy." - https://plato.stanford.edu/entries/truth/

▲

miningape a day ago | parent [-]

Ah yes, that makes a lot more sense - I understood your comment as something like "the LLMs are always correct, we just need to redefine how programming languages work"

I think I made it halfway to your _actual_ point and then just missed it entirely.

> If you cannot agree on the correct interpretation, nor output, what stops an LLM from solving the wrong problem?

	▲	dworks a day ago \| parent [-]
		Yep. I'm saying the problem is not just about interpreting and validating the output. You need to also interpret the question, since its in natural language rather than code, so its not just twice as hard but strictly impossible to reach a 100% accuracy with an LLM because you can't define what is correct in every case.

▲

codingdave a day ago | parent | prev [-]

It seems to me that we already have enough people using the "truth is subjective" arguments to defend misinformation campaigns. Maybe we don't need to expand it into even more areas. Those philosophical discussions are interesting in a classroom setting, but far less interesting when talking about real-world impact on people and society. Or perhaps "less interesting" is unfair, but when LLMs straight up get facts wrong, that is not the time for philosophical pontification about the nature of accuracy. They are just wrong.

	▲	dworks a day ago \| parent [-]
		I'm not making excuses for LLMs. I'm saying that when you have a non-deterministic system for which you have to evaluate all the output for correctness due to its impredictability, it is a practically impossible task.

▲

rickydroll a day ago | parent | prev [-]

Yes, in general, English is non-deterministic, e.g., reading a sentence with the absence or presence of an Oxford comma.

When I programmed for a living, I found coding quite tedious and preferred to start with a mix of English and mathematics, describing what I wanted to do, and then translate that text into code. When I discovered Literate Programming, it was significantly closer to my way of thinking. Literate programming was not without its shortcomings and lacked many aspects of programming languages we have come to rely on today.

Today, when I write small to medium-sized programs, it reads mostly like a specification, and it's not much bigger than the code itself. There are instances where I need to write a sentence or brief paragraph to prompt the LLM to generate correct code, but this doesn't significantly disrupt the flow of the document.

However, if this is going to be a practical approach, we will need a deterministic system that can use English and predicate calculus to generate reproducible software.

	▲	daxfohl a day ago \| parent [-]
		Interesting, I'm the opposite! I far prefer to start off with a bit of code to help explore gotchas I might not have thought about and to help solidify my thoughts and approach. It doesn't have to be complete, or even compile. Just enough to identify the tradeoffs of whatever I'm doing. Once I have that, it's usually far easier to flesh out the details in the detailed design doc, or go back to the Product team and discuss conflicting or vague requirements, or opportunities for tweaks that could lead to more flexibility or whatever else. Then from there it's usually easier to get the rest of the team on the same page, as I feel I'll understand more concretely the tradeoffs that were made in the design and why. (Not saying one approach is better than the other. I just find the difference interesting).

▲

gloxkiqcza 2 days ago | parent | prev | next [-]

Correct me if I’m wrong but LLMs are deterministic, the randomness is added intentionally in the pipeline.

▲

mzl a day ago | parent | next [-]

LLMs can be run in a mostly deterministic mode (see https://docs.pytorch.org/docs/stable/notes/randomness.html for some info on running PyTorch programs).

Varying the deployment type (chip model, number of chips, batch size, ...) can also change the output due to rounding errors. See https://arxiv.org/abs/2506.09501 for some details on that.

▲

a day ago | parent | prev | next [-]

[deleted]

▲

zekica a day ago | parent | prev [-]

The two parts of your statement don't go together. A list of potential output tokens and their probabilities are generated deterministically but the actual token returned is then chosen at random (weighted based on the "temperature" parameter and the probability value).

▲

galaxyLogic a day ago | parent | next [-]

I assume they use software-based pseudo-random-number generators. Those can typically be given a seed-value which determines (deterministically) the sequence of random numbers that will be generated.

So if an LLM uses a seedable pseudo-random-number-generator for its random numbers, then it can be fully deterministic.

	▲	lou1306 a day ago \| parent [-]
		There are subtle sources of nondeterminism in concurrent floating point operations, especially on GPU. So even with a fixed seed, if an LLM encounters two tokens with very close likelihoods, it may pick one or the other across different runs. This has been observed even with temperature=0, which in principle does not involve _any_ randomness (see arXiv paper cited earlier in this thread).

▲

mzl a day ago | parent | prev [-]

That depends on the sampling strategy. Greedy sampling takes the max token at each step.

▲

pjmlp a day ago | parent | prev | next [-]

As much as many devs that haven't read the respective ISO standards, the compiler manual back to back, and then get surprised with UB based optimizations.

▲

mzl a day ago | parent | prev [-]

Many compilers are not deterministic (it is why repeatable builds is not a solved problem), and many LLMs can be run in a mostly deterministic way.

▲

miningape a day ago | parent [-]

Repeatable builds are not a requirement for determinism. Since the outputs can be determined based on the exact system running the code, it is deterministic - even though the output can vary based on the system running the code.

This is to say every output can be understood by understanding the systems that produced it. There are no dice rolls required. I.e. if it builds wrongly every other Tuesday, the reason for that can be determined (there's a line of code describing this logic).

	▲	rthnbgrredf a day ago \| parent [-]
		While I don't disagree with your comment, I would say that a that a large language model, and a Docker build with a complex Dockerfile, where not every version is exactly pinned down, are quite similar. You might have updates from the base image, you might have updates from one of the thousands of dependencies. And each day you rebuild the image, you will get a different checksum. Similar to how you get different answers from the LLM. And just like you can get wrong answers from the LLM, you can also get Docker builds that start to behave differently over time. So this is how it often is in practice. Then there is the possibility to pin down every version, and also some large language models support temperature 0. This is more in the realm of determinism.