Markov chains of order n are essentially n-gram models - and this is what language models used to be for a very long time. They are quite good. As a matter of fact, they were so good that more sophisticated models often couldn't beat them.

But then came deep-learning models - think transformers. Here, you don't represent your inputs and states discretely but you have a representation in a higher-dimensional space that aims at preserving some sort of "semantics": proximity in that space means proximity in meaning. This allows to capture nuances much more finely than it is possible with sequences of symbols from a set.

Take this example: you're given a sequence of n words and are to predict a good word to follow that sequence. That's the thing that LM's do. Now, if you're an n-gram model and have never seen that sequence in training, what are you going to predict? You have no data in your probabilty tables. So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen. How? That's the secret sauce, but there are multiple approaches.

With NN-based LLMs, you don't have that exact same issue: even if you have never seen that n-word sequence in training, it will get mapped into your high-dimensional space. And from there you'll get a distribution that tells you which words are good follow-ups. If you have seen sequences of similar meaning (even with different words) in training, these will probably be better predictions.

But for n-grams, just because you have seen sequences of similar meaning (but with different words) during training, that doesn't really help you all that much.

▲

ActorNightly 2 days ago | parent | next [-]

In theory, you could have a large enough markov chain that mimicks an LLM, you would just need it to be exponentially larger in width.

After all, its just matrix multplies start to finish.

A lot of the other data operation (like normalization) can be represented as matrix multiplies, just less efficiently. In the same way that a transformer can be represented inefficiency as a set of fully connected deep layers.

	▲	kleiba a day ago \| parent [-]
		True. But the considerations re: practicability are not to be ignored.

▲

andai a day ago | parent | prev | next [-]

>just because you have seen sequences of similar meaning (but with different words) during training, that doesn't really help you all that much.

Sounds solvable with synonyms? The same way keyword search is brittle but does much better when you add keyword expansion.

Probably the arbitrariness of grammar would nuke performance here. You'd want to normalize the sentence structure too. Hmm...

▲

wartywhoa23 a day ago | parent | prev | next [-]

> So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen.

And then you can build a trillion dollar industry selling hallucinations.

▲

sadid a day ago | parent | prev | next [-]

yes, but on this n-gram vs transformers; if you consider more general paradigm, self attention mechanism is basically a special form of a graph neural networks [1].

[1] Bridging Graph Neural Networks and Large Language Models: A Survey and Unified Perspective https://infoscience.epfl.ch/server/api/core/bitstreams/7e6f8...

▲

esafak 2 days ago | parent | prev [-]

https://en.wikipedia.org/wiki/Distributional_semantics