| ▲ | thatjoeoverthr 2 days ago | |
Others have mentioned the large context window. This matters. But also important is embeddings. Tokens in a classic Markov chain are discrete surrogate keys. “Love”, for example, and “love” are two different tokens. As are “rage” and “fury”. In a modern model, we start with an embedding model, and build a LUT mapping token identities to vectors. This does two things for you. First, it solves the above problem, which is that “different” tokens can be conceptually similar. They’re embedded in a space where they can be compared and contrasted in many dimensions, and it becomes less sensitive to wording. Second, because the incoming context is now a tensor, it can be used with differentiable model, back propagation and so forth. I did something with this lately, actually, using a trained BERT model as a reranker for Markov chain emmisions. It’s rough but manages multiturn conversation on a consumer GPU. | ||
| ▲ | cestith 20 hours ago | parent [-] | |
The case sensitivity or case insensitivity of a token is an implementation detail. I also haven’t seen evidence that the Markov function definitionally can’t use a lookup table of synonyms when predicting the next token. | ||