Remix.run Logo
comeonbro 9 hours ago

It's not a technical challenge in this case, it's a technical tradeoff. You could train an LLM with single characters as the atomic unit and it would be able to count the 'r's in 'strawberry' no problem. The tradeoff is that then processing the word 'strawberry' would then be 10 sequential steps, 10 complete runs through the entire LLM, where one has to finish before you can start the next one.

Instead, they're almost always trained with (what we see as, but they literally do not) multi-character tokens as the atomic unit, so 'strawberry' is spelled 'Ⰹ⧏⏃'. Processing that is only 3 sequential steps, only 3 complete runs through the entire LLM. But it needs to encounter enough relevant text in training to be able to figure out that 'Ⰹ' somehow has 1 'r' in it, '⧏' has 0 'r's, and '⏃' has 2 'r's, which really not a lot of text demonstrates, to be able to count the 'r's in 'Ⰹ⧏⏃ correctly.

The tradeoff in this is everything being 3-5x slower and more expensive (but you can count the 'r's in 'strawberry'), vs, basically only, being bad at character-level tasks like counting letters in words.

Easy choice, but leads to this stupid misundertanding being absolutely everywhere and just by itself doing an enormous amount of damage to peoples' ability to understand what is happening and about to happen.