Remix.run Logo
pornel 5 days ago

This isn't about smarts, but about performance and memory usage.

Tokens are a form of compression, and working on uncompressed representation would require more memory and more processing power.

amelius 5 days ago | parent [-]

The opposite is true. Ascii and English are pretty good at compressing. I can say "cat" with just 24 bits. Your average LLM token embedding uses on the order of kilobits internally.

pornel 5 days ago | parent | next [-]

You can have "cat" as 1 token, or you can have "c" "a" "t" as 3 tokens.

In either case, the tokens are a necessary part of LLMs. They have to have a differentiable representation in order to be possible to train effectively. High-dimensional embeddings are differentiable and are able to usefully represent "meaning" of a token.

In other words, the representation of "cat" in an LLM must be something that can be gradually nudged towards "kitten", or "print", or "excavator", or other possible meanings. This is doable with the large vector representation, but such operation makes no sense when you try to represent the meaning directly in ASCII.

amelius 5 days ago | parent [-]

True, but imagine an input that is ASCII, followed by some layers of NN that result in an embedded representation and from there the usual NN layers of your LLM. The first layers can have shared weights (shared between inputs). Thus, let the LLM solve the embedding problem implicitly. Why wouldn't this work? It is much more elegant because the entire design would consist of neural networks, no extra code or data treatment necessary.

mathis 4 days ago | parent | next [-]

This might be more pure, but there is nothing to be gained. On the contrary, this would lead to very long sequences for which self-attention scales poorly.

pornel 4 days ago | parent | prev [-]

The tokens are basically this, a result of precomputing and caching such layers.

blutfink 5 days ago | parent | prev | next [-]

The LLM can also “say” “cat” with few bits. Note that the meaning of the word as stored in your brain takes more than 24 bits.

amelius 5 days ago | parent [-]

No, an LLM really uses __much__ more bits per token.

First, the embedding typically uses thousands of dimensions.

Then, the value along each dimension is represented with a floating point number which will take 16 bits (can be smaller though with higher quantization).

blutfink 5 days ago | parent [-]

Of course an LLM uses more space internally for a token. But so do humans.

My point was that you compared how the LLM represents a token internally versus how “English” transmits a word. That’s a category error.

amelius 5 days ago | parent [-]

But humans we can feed ascii, whereas LLMs require token inputs. My original question was about that: why can't we just feed the LLMs ascii, and let it figure out how it wants to encode that internally, __implicitly__? I.e., we just design a network and feed it ascii, as opposed to figuring out an encoding in a separate step and feeding it tokens in that encoding.

cesarb 5 days ago | parent [-]

> But humans we can feed ascii, whereas LLMs require token inputs.

To be pedantic, we can't feed humans ASCII directly, we have to convert it to images or sounds first.

> My original question was about that: why can't we just feed the LLMs ascii, and let it figure out how it wants to encode that internally, __implicitly__? I.e., we just design a network and feed it ascii, as opposed to figuring out an encoding in a separate step and feeding it tokens in that encoding.

That could be done, by having only 256 tokens, one for each possible byte, plus perhaps a few special-use tokens like "end of sequence". But it would be much less efficient.

amelius 5 days ago | parent [-]

Why would it be less efficient, if the LLM would convert it to an embedding internally?

cesarb 4 days ago | parent [-]

Because each byte would be an embedding, instead of several bytes (a full word or part of a word) being a single embedding. The amount of time a LLM takes is proportional to the number of embeddings (or tokens, since each token is represented by an embedding) in the input, and the amount of memory used by the internal state of the LLM is also proportional to the number of embeddings in the context window (how far it looks back in the input).

fragmede 4 days ago | parent | prev [-]

The token for cat is 464 which is just 9 bits.