| ▲ | yencabulator 11 days ago | ||||||||||||||||
A tokenizer is roughly and approximately Huffman-coding sequences of input (bytes of English etc) into shorter sequences (list of tokens), as a performance optimization. As you said, it's not in any way intrinsic to the LLM, though it may be a very necessary optimization on today's hardware. | |||||||||||||||||
| ▲ | phire 11 days ago | parent [-] | ||||||||||||||||
I wouldn't use the word necessary. IMO, we are probably talking about a 6x slow down (for typical english). You would need to be absolutely stupid not to implement some kind of optimisation along these lines. Slower and maybe a little dumber; But it would work. | |||||||||||||||||
| |||||||||||||||||