Remix.run Logo
pornel 5 days ago

You can have "cat" as 1 token, or you can have "c" "a" "t" as 3 tokens.

In either case, the tokens are a necessary part of LLMs. They have to have a differentiable representation in order to be possible to train effectively. High-dimensional embeddings are differentiable and are able to usefully represent "meaning" of a token.

In other words, the representation of "cat" in an LLM must be something that can be gradually nudged towards "kitten", or "print", or "excavator", or other possible meanings. This is doable with the large vector representation, but such operation makes no sense when you try to represent the meaning directly in ASCII.

amelius 5 days ago | parent [-]

True, but imagine an input that is ASCII, followed by some layers of NN that result in an embedded representation and from there the usual NN layers of your LLM. The first layers can have shared weights (shared between inputs). Thus, let the LLM solve the embedding problem implicitly. Why wouldn't this work? It is much more elegant because the entire design would consist of neural networks, no extra code or data treatment necessary.

mathis 4 days ago | parent | next [-]

This might be more pure, but there is nothing to be gained. On the contrary, this would lead to very long sequences for which self-attention scales poorly.

pornel 4 days ago | parent | prev [-]

The tokens are basically this, a result of precomputing and caching such layers.