▲	ssivark 2 days ago
		The Markov property means that the next token is determined purely by the current token. Well, if it were a Hidden Markov Model, the next state would actually be determined by the current state, and the respective tokens would be a lossy representation of the states. The problem with HMMs is that the sequence model (Markov transition matrix) accounts for much less context than even Tiny LLMs. One natural way to improve this is to allow the model to have more hidden states, representing more context -- called "clones" because these different hidden states would all be producing the same token while actually carrying different underlying contexts that might be relevant for future tokens. We are thus taking a non-Markov model (like a transformer) and re-framing its representation to be Markov. There have been sequence models with this idea aka Cloned HMMs (CHMMs) [1] or Clone-Structured Cognitive Graphs (CSCGs) [2]. The latter name is inspired by some related work in neuroscience, to which these were applied, which showed how these graphical models map nicely to "cognitive schemas" and are particularly effective in discovering interpretable models of spatial structure. I did some unpublished work a couple of years ago (while at Google DeepMind) studying how CHMMs scale to simple ~GB sized language data sets like Tiny Stories [3]. As a subjective opinion, while they're not as good as small transformers, they do generate text that is surprisingly good compared with naive expectations of Markov models. The challenge is that learning algorithms that we typically use for HMMs (eg. Expectation Maximization) are somewhat hard to optimize & scale for contemporary AI hardware (GPU/TPU), and a transformer model trained by gradient descent with lots of compute works pretty well, and also scales well to larger datasets and model sizes. I later switched to working on other things, but I still sometimes wonder whether it might be possible to cook up better learning algorithms attacking the problem of disambiguating contexts during the learning phase. The advantage with an explicit/structured graphical model like a CHMM is that it is very interpretable, and allows for extremely flexible queries at inference time -- unlike transformers (or other sequence models) which are trained as "policies" for generating token streams. When I say that transformers don't allow flexible querying I'm glossing over in-context learning capabilities, since we still lack a clear/complete understanding and what kinds of pre-training and fine-tuning one needs to elicit them (which are frontier research questions at the moment, and it requires a more nuanced discussion than a quick HN comment). It turns out, funnily, that these properties of CHMMs actually proved very useful [4] in understanding the conceptual underpinnings of in-context learning behavior using simple Markov sequence models instead of "high-powered" transformers. Some recent work from OpenAI [5] on sparse+interpretable transformer models seems to suggest that in-context learning in transformer LLMs might work analogously, by learning schema circuits. So the fact that we can learn similar schema circuits with CHMMs makes me believe that what we have is a learning challenge and it's not actually a fundamental representational incapacity (as is loosely claimed sometimes). In the spirit of full disclosure, I worked on [4]; if you want a rapid summary of all the ideas in this post, including a quick introduction to CHMMs, I would recommend the following video presentation / slides [6]. [1]: https://arxiv.org/abs/1905.00507 [2]: https://www.nature.com/articles/s41467-021-22559-5 [3]: https://arxiv.org/abs/2305.07759 [4]: https://arxiv.org/abs/2307.01201 [5]: https://openai.com/index/understanding-neural-networks-throu... [6]: https://slideslive.com/39010747/schemalearning-and-rebinding...