| ▲ | ComplexSystems 2 days ago | |
A few things: First, modern LLMs can be thought, abstractly, as a kind of Markov model. We are taking the entire previous output as one state vector and from there we have a distribution to the next state vector, which is the updated output with another token added. The point is that there is some subtlety in what a "state" is. So that's one thing. But the point of the usual Markov chain is that we need to figure out the next conditional probability based on the entire previous history. Making a lookup table based on an exponentially increasing history of possible combinations of tokens is impossible, so we make a lookup table on the last N tokens instead - this is an N-gram LLM or an N'th order Markov chain, where states are now individual tokens. It is much easier, but it doesn't give great results. The main reason here is that sometimes, the last N words (or tokens, whatever) simply do not have sufficient info about what the next word should be. Often times some fragment of context way back at the beginning was much more relevant. You can increase N, but then sometimes there are a bunch of intervening grammatical filler words that are useless, and it also gets exponentially large. So the 5 most important words to look at, given the current word, could be 5 words scattered about the history, rather than the last 5. And this is always evolving and differs for each new word. Attention solves this problem. Instead of always looking at the last 5 words, or last N words, we have a dynamically varying "score" for how relevant each of the previous words is given the current one we want to predict. This idea is closer to the way humans parse real language. A Markov model can be thought of as a very primitive version of this where we always just attend evenly to the last N tokens and ignore everything else. So you can think of attention as kind of like an infinite-order Markov chain, but with variable weights representing how important past tokens are, and which is always dynamically adjusting as the text stream goes on. The other difference is that we no longer can have a simple lookup table like we do with n-gram Markov models. Instead, we need to somehow build some complex function that takes in the previous context and computes outputs the correct next-token distribution. We cannot just store the distribution of tokens given every possible combination of previous ones (and with variable weights on top of it!), as there are infinitely many. It's kind of like we need to "compress" the hypothetically exponentially large lookup table into some kind of simple expression that lets us compute what the lookup table would be without having to store every possible output at once. Both of these things - computing attention scores, and figuring out some formula for the next-token distribution - are currently solved with deep networks just trying to learn from data and perform gradient descent until it magically starts giving good results. But if the network isn't powerful enough, it won't give good results - maybe comparable to a more primitive n-gram model. So that's why you see what you are seeing. | ||