LLMs are deterministic, the same model under the same conditions will produce the same output, unless some randomness is purposefully injected. Neural networks in general can be thought of as universal function approximators.

▲

mrob 4 hours ago | parent | next [-]

Whenever somebody calls LLMs "non-deterministic", assume they meant "chaotic", in the informal sense of being a system where small changes of input can cause large changes to output, and the only way to find out if it will happen is by running the full calculation.

For many applications, this is equally troublesome as true non-determinism.

▲

conorbergin 3 hours ago | parent [-]

I don't think LLMs are that chaotic, you can replace words in an input at get a similar answer, and they are very good at dealing with typos.

They are definitely not interpretable, I was reading some stuff from mechanistic interpretability researchers saying they've given up trying to build a bottom up model of how they work.

▲

mylifeandtimes 2 hours ago | parent [-]

> I don't think LLMs are that chaotic, you can replace words in an input at get a similar answer, and they are very good at dealing with typos.

Compare "You are a helpful assistant. Your task is to <100 lines of task description> <example problem>"

with

"you are a helpless assistant. Your task is to <100 lines of task description> <example problem>"

I've changed 3 or 4 CHARACTERS ("ful" to "less") out of a (by construction) 1000+ character prompt.

and the outputs are not at all similar.

Just realized I've never tried the "you are a helpless ass" prompt. Again a very minor change in wording, just dropping a few letters. The helpless assistant at least output text apologizing for being so bad at the task.

	▲	orbital-decay an hour ago \| parent [-]
		Sure. What did you expect? You changed the semantic of your prompt to the complete opposite. Of course it will attempt to make sense of it to its ability, and deliver what you requested. The input isn't formally specified, that's inherent for the domain, not the model or a human. GP, on the other hand, is talking about semantically negligible differences like typos.

▲

2ndorderthought 5 hours ago | parent | prev | next [-]

That's not really true. If you turn a few knobs you can make them deterministic. Namely setting temperature to zero, and turning off all history. But none of the cloud providers do this. Because it's not a product as far as they are concerned. So in practice - not so much.

▲

maplethorpe 5 hours ago | parent | next [-]

Can someone explain why this is? Do LLMs somehow contain a true random number generator? Why wouldn't they produce the same outputs given the same inputs (even temperature)?

edit: I'm not talking about an LLM as accessed through a provider. I'm just talking about using a model directly. Why wouldn't that be deterministic?

▲

nowittyusername 41 minutes ago | parent | next [-]

There are A LOT of misconceptions about llms, biggest one is they are not deterministic. And they are 100% deterministic and temperature has nothing to do with it. You WILL get exactly same result every single time (at ANY temperature) as long as you use same sampling parameters and server config parameters. What causes variance in LLM's is server parameters like batch processing and caching among a few other things possibly. the batching being responsible for most of the issues. The reason that flag is used is because large providers serve multiple customers per one gpu, and breaking up the vram is tricky and causes drift. If you start llama.cpp for example with only one person per slot batching off, you will always get same results every time even at temperature 1.2 or whatever other parameters because you are using one gpu per inferance call so no fucky buseness there. Reason most people are unaware of this is because most people have experience only with api instead of working with the actual inferance enjine itself so this godd damned myth keeps spreading. my vide for referance here where you can download and try for yourself. https://www.youtube.com/watch?v=EyE5BrUut2o

▲

anon373839 4 hours ago | parent | prev | next [-]

The model outputs a probability distribution for the next token, given the sequence of all previous tokens in the context window. It’s just a list of floats in the same order as the list of tokens that the tokenizer uses.

After that, a piece of software that is NOT the LLM chooses the next token. This is called the sampler. There are different sampling parameters and strategies available, but if you want repeatable* outputs, just take the token with the highest probability number.

* Perfect determinism in this sense is difficult to achieve because GPU calculations naturally have a minor bit of nondeterminism. But you can get very close.

	▲	2ndorderthought 4 hours ago \| parent [-]
		I'm not so sold the LLM is an LLM without a sampler but it's not worth quibbling over. It's part of the statistical model anyways.

▲

evrydayhustling 4 hours ago | parent | prev | next [-]

An LLM model itself -- that is, the weights and the mathematical functions linking them -- does not tell you exactly how to train from data, nor how to generate an output. Instead, it describes a function providing relative likelihood(output | input).

Deciding how to pick a particular output given that likelihood function is left as an exercise for the user, which we call inference.

One obvious choice is to keep picking the highest likelihood token, feed it into the model, and get another -- on repeat. This is what most algorithms call "temperature=0". But doing this for token after token can lead boring output, or steer you into pathological low-probability sequences like a set of endless repeats.

So, the current SOTA is to intentionally introduce a random factor (temperature>0) to the sampling process -- along with other hacks, like explicit suppression of repeats.

▲

2ndorderthought 4 hours ago | parent | prev [-]

Yea sure. So temperature is baked into these LLM models and when it isn't zero it increases the probability of taking a different path to decode the tokens. Whether it's at a provider or downloaded on your own machine.

Technically even when the temperature is 0 it's not deterministic but it's more likely to be... You can have ties in probabilities for generating the next words. And floating point noise is real.

All these models are doing is guesstimating the next token to say.

▲

slashdave 4 hours ago | parent | prev [-]

Eh, conceptually true, but in practice, it is rather hard to get any decent performance out of a GPU and still produce a deterministic answer.

And in any case, setting the temperature to zero will not produce a useful result, unless you don't mind your LLM constantly running into infinite loops.

▲

alansaber 5 hours ago | parent | prev | next [-]

Yes theres a good thinking machines lab blog about this

▲

0-_-0 5 hours ago | parent | prev [-]

You're being downvoted, but you're right. Determinism is a different concept and doesn't characterise LLMs well. You can have deterministic random number generators for example.