| ▲ | dvt 6 hours ago |
| An alarming number of people don't understand that LLMs work via purely stochastic processes, so I'm happy to see in-depth pieces like this. I'm looking for a job and maybe this is why it's so hard to get a callback these days: resumes are just dumped in some LLM black hole and no one really knows how it works. The author says: > temperature 0.1 — low, supposedly nudging the model toward deterministic outputs This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution). |
|
| ▲ | miki123211 4 hours ago | parent | next [-] |
| In theory, temperature 0 does make the LLM deterministic. Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division). However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs. |
| |
| ▲ | sigmoid10 4 hours ago | parent | next [-] | | >in theory theory, temperature 0 doesn't really exist. It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling. | | |
| ▲ | 317070 an hour ago | parent | next [-] | | > Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options. | | |
| ▲ | sigmoid10 26 minutes ago | parent [-] | | That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal. That's why greedy sampling generally produces deterministic output for LLMs. The real gotchas are elsewhere (like with batch inference as we've seen with earlier GPTs). But unlike what the earlier comment says, this is a non-issue mathematically. | | |
| ▲ | skissane 11 minutes ago | parent [-] | | > That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal Any two tokens ending up with the exact same logit is very unlikely, but not impossible; and as the number of output tokens grows, the odds that it will happen eventually gets higher and higher. I suppose, to ensure determinism, rank by logit then token ID, so you still have a deterministic winner even if occasionally two tokens get precisely identical logits. |
|
| |
| ▲ | thaumasiotes an hour ago | parent | prev [-] | | > It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check". | | |
| ▲ | sigmoid10 an hour ago | parent [-] | | The point is that the case T=0 doesn't just "exist" as a special code branch - it is still well defined mathematically without any change to the output function. What the above comment refers to with the extra "if" check is just a limitation of computers not liking to divide anything by zero, even if the actual function exists and is well behaved at zero. It is not some weird or special theoretical construction. |
|
| |
| ▲ | nullc an hour ago | parent | prev | next [-] | | If you make an exact integer implementation and run with temp=0 it's deterministic. You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input. But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise. | |
| ▲ | chrisjj 2 hours ago | parent | prev | next [-] | | > However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run The implementation does not often differ run by run. | | |
| ▲ | skissane 7 minutes ago | parent [-] | | > The implementation does not often differ run by run. If you use a cluster, or even multiple clusters, and they have non-identical hardware, then two consecutive runs could end up being routed to nodes having different GPU models with slightly different floating point behaviour, or even software differences (e.g. newer GPU offers some feature usable to speed up calculations which older model lacked; same code can use the feature when it is available, fall back to slower alternative if it isn’t). The larger your scale, the greater the odds it will happen |
| |
| ▲ | lelandbatey 3 hours ago | parent | prev [-] | | As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware). But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent. They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models. | | |
| ▲ | microtonal 3 hours ago | parent | next [-] | | Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient. | |
| ▲ | toolslive 3 hours ago | parent | prev | next [-] | | It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations. | | |
| ▲ | rightbyte 2 hours ago | parent [-] | | How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc. | | |
| |
| ▲ | nok22kon 3 hours ago | parent | prev [-] | | that's incorrect in the presence of batching. it's tough work making it truly deterministic: https://x.com/FireworksAI_HQ/status/2069873437217276015 | | |
| ▲ | vidarh 2 hours ago | parent [-] | | It's not that hard. What is hard is making it truly deterministic and retain high throughput. |
|
|
|
|
| ▲ | vessenes 22 minutes ago | parent | prev | next [-] |
| To be clear, temperature 0 is deterministic and will produce the same output for exact duplicate inputs, across all seed choices. Provided: * If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.) * Your kernels are deterministic * There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model) Upshot: Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably. To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no? |
| |
| ▲ | Dylan16807 16 minutes ago | parent [-] | | Even then it's deterministic in the way a hash function is deterministic. Change one letter and you can get a completely different output. What people actually want is something continuous. | | |
|
|
| ▲ | aesthesia 5 hours ago | parent | prev | next [-] |
| A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did. |
| |
| ▲ | 317070 5 hours ago | parent | next [-] | | > so in principle, setting temperature to 0 _should_ result in deterministic outputs It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element. Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network. | | |
| ▲ | EvgeniyZh 5 hours ago | parent | next [-] | | You don't have to sample uniformly. You could take the lowest index of all maxima.
But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it | |
| ▲ | jstanley 3 hours ago | parent | prev | next [-] | | > "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM. | | |
| ▲ | vbarrielle 3 hours ago | parent [-] | | It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable. | | |
| |
| ▲ | DougBTX 3 hours ago | parent | prev [-] | | > GPUs put the associativity of the sums in matrix multiplications in arbitrary order That’s user-controlled too, not an inherent property of GPUs: https://docs.pytorch.org/docs/2.12/generated/torch.use_deter... | | |
| ▲ | vbarrielle 2 hours ago | parent [-] | | The matrix multiplication is only deterministic for sparse-dense products under these settings: > torch.bmm() when called on sparse-dense CUDA tensors And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic. |
|
| |
| ▲ | easygenes 5 hours ago | parent | prev | next [-] | | There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves). | |
| ▲ | IshKebab 5 hours ago | parent | prev | next [-] | | Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample. | |
| ▲ | croes 4 hours ago | parent | prev | next [-] | | So you would get always the same result, but it could be the wrong one | | | |
| ▲ | valzam 5 hours ago | parent | prev [-] | | I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2 | | |
| ▲ | aesthesia 5 hours ago | parent [-] | | No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step. | | |
| ▲ | dvt 5 hours ago | parent [-] | | This is a very authoritative answer that should be more nuanced and caveated as implementation-dependent. In some cases, repetition penalties take precedence over sampling; top_k and top_p can also be handled before or after the temperature step. In other cases, `0` is turned into like 1e-10 or some super tiny float value (which can drift if you do any arithmetic with it). Routing, quantization, etc. can also have an effect on sampling. And yes, in some cases, setting temperature to 0 can mean "pure greedy decoding" which makes the decoder about as deterministic as it can get. |
|
|
|
|
| ▲ | 3 hours ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | make3 5 hours ago | parent | prev | next [-] |
| A more spikey distribution exactly makes the distribution closer to deterministic. That's not the point though. Even in greedy (deterministic) decoding, it is still a black box though that reacts in ways ways that are unpredictable to the inputs. Switching one word around might lead to different scores for example. |
| |
| ▲ | fluoridation 44 minutes ago | parent [-] | | Yeah, this is the forest that the people arguing about math trees are missing. It doesn't matter that the algorithm is deterministic if the algorithm passes the input through a cryptographic hash function to make a yes/no decision. The result may be perfectly reproducible and still non-sensical in its distribution with respect to its input domain. |
|
|
| ▲ | bhanu786 3 hours ago | parent | prev | next [-] |
| Agree |
|
| ▲ | spwa4 4 hours ago | parent | prev | next [-] |
| > An alarming number of people don't understand that LLMs work via purely stochastic processes ... I've been studying AI for 20 years. What really needs to be added to this statement is: "An alarming number of people don't understand that LLMs work via purely stochastic processes - and so does human thinking. People do NOT arrive at the same conclusion if merely the weather's different. Worse: with human thinking not only do most people not think this is real, a subset of people will actively fight the idea. Of course, depending on the weather" |
| |
| ▲ | thisisit 31 minutes ago | parent | next [-] | | The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases. | |
| ▲ | miki123211 4 hours ago | parent | prev | next [-] | | What's even worse, different humans have different weights. If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound. Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures". | | |
| ▲ | chrisjj 2 hours ago | parent [-] | | > What's even worse, different humans have different weights. Far worse would be different humans having the same weights. |
| |
| ▲ | smusamashah 4 hours ago | parent | prev | next [-] | | We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too. | | |
| ▲ | vidarh 2 hours ago | parent [-] | | And this lies at the heart of the problem. We expect computers to be consistent despite running programs that are not designed to be consistent. This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs. But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs. | | |
| ▲ | chrisjj 2 hours ago | parent [-] | | > This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs. The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty. | | |
| ▲ | newswasboring 32 minutes ago | parent [-] | | Yeah but daily tools have lots of complexity which appears as non determinism (if we are thinking only UX, not actual determinism). For example, try moving an image in the word doc. I have been using MS word my entire life it seems, still don't know what the rules are lol. |
|
|
| |
| ▲ | mnky9800n 4 hours ago | parent | prev | next [-] | | Test retest reliability is a thing in psychometrics. | | |
| ▲ | spwa4 2 hours ago | parent [-] | | Ah cool. So there is data? How consistent are humans? What I'd really love is an actual number for a "human hallucination rate". How often will a random human 1) claim something that is wrong 2) defend the wrong claim and/or logic even when the problem is pointed out to them (and this of course outside of the usual topics. In politics? I don't care. In religion? Don't care (well, maybe a bit more than politics). Let's say in physics or popular logic or something like that) |
| |
| ▲ | cyanydeez an hour ago | parent | prev [-] | | a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach. | | |
| ▲ | WhrRTheBaboons 44 minutes ago | parent [-] | | how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks. | | |
| ▲ | chuckadams 24 minutes ago | parent [-] | | The studies observed the results of decisions from the exact same charges. Judges don't get to pick their dockets. |
|
|
|
|
| ▲ | nok22kon 3 hours ago | parent | prev | next [-] |
| its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0. using low temperature is more deterministic, but the cost is the model becomes "dumber" |
| |
| ▲ | tipsytoad 2 hours ago | parent | next [-] | | 1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default | | |
| ▲ | 317070 an hour ago | parent | next [-] | | If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher. After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution. That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0. | |
| ▲ | zipy124 2 hours ago | parent | prev | next [-] | | It really depends on the application does it not? I'm not an LLM guy, but for creative tasks like storytelling wouldn't you want a higher temperature usually? Happy to gain insight from anyone with experience here :) | |
| ▲ | embedding-shape 2 hours ago | parent | prev | next [-] | | Heavily depends on the model architecture and the implementation though, I don't think you can say what values are better than others without first specifying those, otherwise it's straight up guessing, ironically. | |
| ▲ | nullc an hour ago | parent | prev [-] | | If you use a model in a configuration far from where it was RLed you get no warranty. (you also get no warranty the other way, however) |
| |
| ▲ | codeflo 2 hours ago | parent | prev | next [-] | | It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind. | |
| ▲ | vidarh 2 hours ago | parent | prev [-] | | Plenty of setups defaults to lower values than 1.0. |
|
|
| ▲ | bluechair 6 hours ago | parent | prev [-] |
| Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical. |
| |
| ▲ | thayne 6 hours ago | parent | next [-] | | I would expect that to depend on jurisdiction. I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision. | |
| ▲ | small_scombrus 6 hours ago | parent | prev | next [-] | | They don't need to actually filter/blackhole to have have the same virtual effect. Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking *scores are generated with AI, mistakes may be made, use only as a guide and verify results | |
| ▲ | ivan_gammel 5 hours ago | parent | prev | next [-] | | In situations when you get hundreds of applications for one open position (real market now), whatever reduces your pool to the size a human can handle, works. You can preserve some diversity metrics in the process. This particular filtering is rather primitive, but LLM as a first filter can definitely do the job. You may burn less tokens than the hourly rate of your HR and it will be fairer than just dumping 50% of unread CVs in trash. | | |
| ▲ | 369548684892826 4 hours ago | parent [-] | | Great until someone realises you’ve filtered out minority groups from the application process (most developers are men so maybe the LLM decided they’re the best fit, but you’ll never know exactly why it screwed your over) and you suddenly have an expensive lawsuit | | |
| ▲ | TeMPOraL an hour ago | parent | next [-] | | LLMs are DEI-aware, as over past few years, their vendors all had various high profile news stories with their models and their default biases, so it's more likely they'll heavily discriminate in favor of minority candidates, not against them. Still, in both cases it would indicate whoever is operating the system is doing a really, really lazy job. It's really not hard to test and supervise LLMs on tasks where they give you mere 2-10x leverage, and prompt adherence today is much better than it was 3 years ago. | |
| ▲ | cyanydeez an hour ago | parent | prev [-] | | this happened a decade ago when a US courted tried to make sentencing decisions via ML. it was easialy demonstrated that the training data was flawed because the justice system was flawed so the data it was trained on was weighted against minorities because it oversampled because you know, police routinely oversample and poverty for es oversampling nonetheless, people will defend history as perfect and say those samples, like nepo babies, are "perfect". |
|
| |
| ▲ | elric 4 hours ago | parent | prev | next [-] | | Under GDPR, you have the right to request manual processing whenever personal data is processed automatically to make a decision about you that has "significant impact". Not being hired seems like it would qualify. | |
| ▲ | dgellow 5 hours ago | parent | prev [-] | | Illegal where? |
|