| ▲ | ozgung 11 days ago | |||||||||||||||||||||||||
I see people give too much importance to specific engineering design choices of the current generation of LLMs. Tokenizer is not an absolutely essential part of the system. It’s just and adapter for text input/output. It can be eliminated completely and model can use bytes directly. I think the short story captures this well. Weights (connections) are the essential and philosophically important part. They do the thinking, memory, singing etc. | ||||||||||||||||||||||||||
| ▲ | yencabulator 11 days ago | parent [-] | |||||||||||||||||||||||||
A tokenizer is roughly and approximately Huffman-coding sequences of input (bytes of English etc) into shorter sequences (list of tokens), as a performance optimization. As you said, it's not in any way intrinsic to the LLM, though it may be a very necessary optimization on today's hardware. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||