I think, strictly speaking, autoregressive LLMs are markov chains of a very high order.

The trick (aside from the order) is the training process by which they are derived from their source data. Simply enumerating the states and transitions in the source data and the probability of each transition from each state in the source doesn’t get you an LLM.

▲

krackers a day ago | parent | next [-]

I always like to think LLMs are markov models in the way that real-world computers are finite state machines. It's technically true, but not a useful abstraction at which to analyze them.

Both LLMs and n-gram models satisfy the markov property, and you could in principle go through and compute explicit transition matrices (something on the size of vocab_size*context_size I think). But LLMs aren't trained as n-gram models, so besides giving you autoregressive-ness, there's not really much you can learn by viewing it as a markov model

▲

dragonwriter 20 hours ago | parent [-]

> Both LLMs and n-gram models satisfy the markov property, and you could in principle go through and compute explicit transition matrices (something on the size of vocab_size*context_size I think).

Isn’t it actually (vocab_size)^(context_size)?

	▲	krackers 16 hours ago \| parent [-]
		Yes, you're right. I typed "**" (exponentiation) but HN ate the second star since I forgot to escape.

▲

JPLeRouzic a day ago | parent | prev [-]

Yes I agree, my code includes a good tokenizer, not a simple word splitter.