Remix.run Logo
amelius 5 days ago

True, but imagine an input that is ASCII, followed by some layers of NN that result in an embedded representation and from there the usual NN layers of your LLM. The first layers can have shared weights (shared between inputs). Thus, let the LLM solve the embedding problem implicitly. Why wouldn't this work? It is much more elegant because the entire design would consist of neural networks, no extra code or data treatment necessary.

mathis 4 days ago | parent | next [-]

This might be more pure, but there is nothing to be gained. On the contrary, this would lead to very long sequences for which self-attention scales poorly.

pornel 4 days ago | parent | prev [-]

The tokens are basically this, a result of precomputing and caching such layers.