| ▲ | jldugger 2 days ago | |||||||||||||||||||||||||
Well, there's kind of two answers here: 1. To the extent that creativity is randomness, LLM inference samples from the token distribution at each step. It's possible (but unlikely!) for an LLM to complete "pig with" with the token sequence "a dragon head" just by random chance. The temperature settings commonly exposed control how often the system takes the most likely candidate tokens. 2. A markov chain model will literally have a matrix entry for every possible combination of inputs. So a 2 degree chain will have n^2 weights, where N is the number of possible tokens. In that situation "pig with" can never be completed with a brand new sentence, because those have literal 0's in the probability. In contrast, transformers consider huge context windows, and start with random weights in huge neural network matrices. What people hope happens is that the NN begins to represent ideas, and connections between them. This gives them a shot at passing "out of distribution" tests, which is a cornerstone of modern AI evaluation. | ||||||||||||||||||||||||||
| ▲ | thesz a day ago | parent | next [-] | |||||||||||||||||||||||||
The less frequent prefixes are usually pruned away and there is a penalty score to add to go to the shorter prefix. In the end, all words are included into the model's prediction and typical n-gram SRILM model is able to generate "the pig with dragon head," also with small probability.Even if you think about Markov Chain information as a tensor (not matrix), the computation of probabilities is not a single lookup, but a series of folds. | ||||||||||||||||||||||||||
| ▲ | vrighter a day ago | parent | prev | next [-] | |||||||||||||||||||||||||
A markov chain model does not specify the implementation details of the function that takes a previous input (and only a previous input) and outputs a probability distribution. You could put all possible inputs into an llm (there's finitely many) and record the resulting output from each input in a table. "Temperature" is applied to the final output, not inside the function. | ||||||||||||||||||||||||||
| ▲ | theGnuMe a day ago | parent | prev | next [-] | |||||||||||||||||||||||||
You can have small epsilons instead of zeros. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | otabdeveloper4 a day ago | parent | prev [-] | |||||||||||||||||||||||||
Re point 1: no, "temperature" is not an inherent property of LLM's. The big cloud providers use the "temperature" setting because having the assistant repeat to you the exact same output sequence exposes the man behind the curtain and breaks suspension of disbelief. But if you run the LLM yourself and you want the best quality output, then turning off "temperature" entirely makes sense. That's what I do. (The downside is that the LLM can then, rarely, get stuck in infinite loops. Again, this isn't a big deal unless you really want to persist with the delusion that an LLM is a human-like assistant.) | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||