Remix.run Logo
pama 2 days ago

LLMs have the ability to learn certain classes of algorithms from their datasets in order to reduce errors when compressing their pretraining data. If you are technically inclined, read the reference: https://arxiv.org/abs/2208.01066 (optionally followup work) to see how llms can pick up complicated algorithms from training on examples that could have been generated by such algorithms (in one of the cases the LLM is better than anything we know; in the rest it is simply just as good as our best algos). Learning such functions from data would not work with Markov chains at any level of training. The LLMs in this study are tiny. They are not really learning a language, but rather how to perform regression.

thesz a day ago | parent [-]

Transformers are performing (soft, continuous) beam search inside them, the width of beam being not bigger than number of k-v pairs in attention mechanism.

In my experience, having a Markov Chain to be equipped with the beam search greatly improve MC's predictive power, even if Markov Chain is ARPA 3-gram model, heavily pruned.

What is more, Markov Chains are not restricted to immediate prefixes, you can use skip grams as well. How to use them and how to mix them into a list of probabilities is shown in the paper on Sparse Non-negative Matrix Language Modeling [1].

[1] https://aclanthology.org/Q16-1024/

I think I should look into that link of yours later. Have slimmed over it, I should say it... smells interesting at some places. For one example, decision trees learning is performed with greedy algorithm which, I believe, does not use oblique splits whereas transformers inherently learn oblique splits.