> Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

> Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

This is an oversimplication of an LLM.

The output layer of an LLM contains logits of all tokens in the vocabulary. This can mean every word, word fragment, punctuation mark, or whatever emoji or symbol it knows. Because the logits are calculated through a whole lot of floating point math, it's very likely that most results will be non-zero. Very close to zero, but still non-zero.

This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain, then the resulting probabilities is an empty set. But if you present them to an LLM, it gets fed into a neural network and eventually a set of logits comes out, and you can still choose the next token based on it.

▲

thaumasiotes a day ago | parent [-]

> This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

> If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain

No, you're confused.

The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain.

	▲	Sohcahtoa82 20 hours ago \| parent [-]
		> The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain. When I talk about the chain "seeing" a sequence, I mean that the sequence existed in the material that was used to generate the probability table. My instinct is to believe that you know this, but are being needlessly pedantic. My point is that if you're using a context length of two, if you prompt a Markov Chain with "my cat", but the sequence "my cat was" never appeared in the training material, than a Markov Chain will never choose "was" as the next word. This property is not true for LLMs. If you prompt an LLM with "my cat", then "was" has a non-zero chance of being chosen as the next word, even if "my cat was" never appeared in the training material.