| ▲ | JPLeRouzic a day ago | |
Thanks for your kind words. My code is not really novel, but it is not like the simplistic Markov chain text generators that are found by the ton on the web. I will further improve my code and publish it when I am satisfied on my Github account. It started as a Simple Language Model [0] as they differ from ordinary Markov generators by incorporating a crude prompt mechanism and a kind of very basic attention mechanism named history. My SLM uses Partial Matching (PPM). The one in the link is character-based and is very simple, but mine uses tokens and is 1300 C lines long. The tokenizer tracks the end of sentences and paragraphs. I didn't use part-of-a-word algorithms as LLMs do, but it's trivial to incorporate. Tokens are represented by a number (again as in LLMs), not a character chain. I use Hash Tables for the Model. There are several mechanisms used for fallbacks when the next state function fails. One of them uses the prompt. It is not demonstrated here. Several other implemented mechanisms are not demonstrated here, like model pruning, skip-grams. I am trying to improve this Markov text generator, and some tips in the comments will be of great help. But my point is not to make an LLM, it's just that LLMs produce good results not because of their supposedly advanced algorithms, but because of two things: - There is an enormous amount of engineering in LLMs, whereas usually there is nearly none in Markov text generators, so people get the impression that Markov text generators are toys. - LLMs are possible because they use impressive hardware improvements over the last decades. My text generator only uses 5MB of RAM when running this example! But as commentators told, the size of the model explodes quickly, and this is a point I should improve in my code. And indeed, LLMs, even small LLMs like NanoGPT are unable to produce results as good as my text generator with only 42KB of training text. https://github.com/JPLeRouzic/Small-language-model-with-comp... | ||