Remix.run Logo
purple_turtle 2 days ago

1) being restricted to exact matches in input is definition of Markov Chains

2) LLMs are not Markov Chains

saithound 2 days ago | parent | next [-]

A Markov chain [1] is a discrete-time stochastic process, in which the value of each variable depends only on the value of the immediately preceding variable, and not any variables in the past.

LLMs are most definitely (discrete-time) Markov chains in this sense: the variables take their values in the context vectors, and the distribution of the new context window depends only on what was sampled previously context.

A Markov chain is a Markov chain, no matter how you implement it in a computer, whether as a lookup table, or an ordinary C function, or a one-layer neural net or a transformer.

LLMs and Markov text generators are technically both Markov chains, so some of the same math applies to both. But that's where the similarities end: e.g. the state space of an LLM is a context window, whereas the state space of a Markov text generator is usually an N-tuple of words.

And since the question here is how tiny LLMs differ from Markov text generators, the differences certainly matter here.

[1] https://en.wikipedia.org/wiki/Discrete-time_Markov_chain

thaumasiotes 2 days ago | parent | prev [-]

> 1) being restricted to exact matches in input is definition of Markov Chains

Here's wikipedia:

> a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

A Markov chain is a finite state machine in which transitions between states may have probabilities other than 0 or 1. In this model, there is no input; the transitions occur according to their probability as time passes.

> 2) LLMs are not Markov Chains

As far as the concept of "Markov chains" has been used in the development of linguistics, they are seen as a tool of text generation. A Markov chain for this purpose is a hash table. The key is a sequence of tokens (in the state-based definition, this sequence is the current state), and the value is a probability distribution over a set of tokens.

To rephrase this slightly, a Markov chain is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then for the following token you should choose t_1 with probability p_1, t_2 with probability p_2, etc...".

Then, to tie this back into the state-based definition, we say that when we choose token t_k, we emit that token into the output, and we also dequeue the first token from our representation of the state and enqueue t_k at the back. This brings us into a new state where we can generate another token.

A large language model is seen slightly differently. It is a function. The independent variable is a sequence of tokens, and the dependent variable is a probability distribution over a set of tokens. Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

You might notice that these two tables contain the same information organized in the same way. The transformation from an LLM to a Markov chain is the identity transformation. The only difference is in what you say you're going to do with it.

Sohcahtoa82 2 days ago | parent | next [-]

> Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

> Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

This is an oversimplication of an LLM.

The output layer of an LLM contains logits of all tokens in the vocabulary. This can mean every word, word fragment, punctuation mark, or whatever emoji or symbol it knows. Because the logits are calculated through a whole lot of floating point math, it's very likely that most results will be non-zero. Very close to zero, but still non-zero.

This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain, then the resulting probabilities is an empty set. But if you present them to an LLM, it gets fed into a neural network and eventually a set of logits comes out, and you can still choose the next token based on it.

thaumasiotes a day ago | parent [-]

> This means that gibberish options for the next token have non-zero probabilities of being chosen. The only reason they don't in reality is because of top-k sampling, temperature, and other filtering that's done on the logits before actually choosing a token.

> If you present a s_1, s_2, ... s_N to a Markov Chain when that series was never seen by the chain

No, you're confused.

The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain.

Sohcahtoa82 20 hours ago | parent [-]

> The chain has never seen anything. The Markov chain is a table of probability distributions. You can create it by any means you see fit. There is no such thing as a "series" of tokens that has been seen by the chain.

When I talk about the chain "seeing" a sequence, I mean that the sequence existed in the material that was used to generate the probability table.

My instinct is to believe that you know this, but are being needlessly pedantic.

My point is that if you're using a context length of two, if you prompt a Markov Chain with "my cat", but the sequence "my cat was" never appeared in the training material, than a Markov Chain will never choose "was" as the next word. This property is not true for LLMs. If you prompt an LLM with "my cat", then "was" has a non-zero chance of being chosen as the next word, even if "my cat was" never appeared in the training material.

purple_turtle 2 days ago | parent | prev [-]

Maybe technically LLM can be converted to an equivalent Markov Chain.

The problem is that even for modest context sizes the size of Markov Chain would by hilariously and monstrously large.

You may as well tell that LLM and a hash table is the same thing.

a day ago | parent | next [-]
[deleted]
thaumasiotes a day ago | parent | prev [-]

As I just mentioned in the comment you're responding to, the way you convert an LLM into an equivalent Markov chain is by doing nothing, since it already is one.

> You may as well tell that LLM and a hash table is the same thing.

No. You may as well say that a hash table and a function are the same thing. And this is in fact a common thing to say, because they are the same thing.

An LLM is a significantly more restricted object than a function is.

purple_turtle a day ago | parent [-]

> LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

No, LLM is not a lookup table for all possible inputs.