Remix.run Logo
comeonbro 5 months ago

Imagine if I asked you how many '⊚'s are in 'Ⰹ⧏⏃'? (the answer is 3, because there is 1 ⊚ in Ⰹ and 2 ⊚s in ⏃)

Much harder question than if I asked you how many '⟕'s are in 'Ⓕ⟕⥒⟲⾵⟕⟕⢼' (the answer is 3, because there are 3 ⟕s there)

You'd need to read through like 100,000x more random internet text to infer that there is 1 ⊚ in Ⰹ and 2 ⊚s in ⏃ (when this is not something that people ever explicitly talk about), than you would need to to figure out that there are 3 ⟕s when 3 ⟕s appear, or to figure out from context clues that Ⰹ⧏⏃s are red and edible.

The former is how tokenization makes 'strawberry' look to LLMs: https://i.imgur.com/IggjwEK.png

It's a consequence of an engineering tradeoff, not a demonstration of a fundamental limitation.

hansmayer 5 months ago | parent [-]

I get the technical challenge. It's just that a system that has to be trained with Petabytes of data, just to (sometimes) correctly solve a problem which a six-seven year old kid is able to solve after learning to spell, may not be the right solution to the problem at hand? Haven't the MBAs been shoving it down our throats that all cost-ineffective solutions have to go? Why are we burning hundreds of billion of dollars into development of tools whose most common use-case (or better said: plea by the VC investors) is a) summarising emails (I am not an idiot who cannot read) b) writing emails (really, I know how to write too, and can do it better) . The only use-case where they are sometimes useful is taking out the boring parts of software development, because of the relatively closed learning context, and as someone who used them for over a year for this, they are not reliable and have to be double-checked, lest you want to introduce more issues in your codebase.

comeonbro 4 months ago | parent [-]

It's not a technical challenge in this case, it's a technical tradeoff. You could train an LLM with single characters as the atomic unit and it would be able to count the 'r's in 'strawberry' no problem. The tradeoff is that then processing the word 'strawberry' would then be 10 sequential steps, 10 complete runs through the entire LLM, where one has to finish before you can start the next one.

Instead, they're almost always trained with (what we see as, but they literally do not) multi-character tokens as the atomic unit, so 'strawberry' is spelled 'Ⰹ⧏⏃'. Processing that is only 3 sequential steps, only 3 complete runs through the entire LLM. But it needs to encounter enough relevant text in training to be able to figure out that 'Ⰹ' somehow has 1 'r' in it, '⧏' has 0 'r's, and '⏃' has 2 'r's, which really not a lot of text demonstrates, to be able to count the 'r's in 'Ⰹ⧏⏃ correctly.

The tradeoff in this is everything being 3-5x slower and more expensive (but you can count the 'r's in 'strawberry'), vs, basically only, being bad at character-level tasks like counting letters in words.

Easy choice, but leads to this stupid misundertanding being absolutely everywhere and just by itself doing an enormous amount of damage to peoples' ability to understand what is happening and about to happen.

hansmayer 4 months ago | parent [-]

Right so... they are still not able to spell the single letters because the algorithm we use to train it to do so is far too expensive? Wake me up when it "happens" (and it gets out of it's current, three-year long 'about to happen' phase), e.g. when it stopps costing 200B USD to do character-level tokenisation in a string, a problem we once first solved some 50-60 years ago, with higher-order programming languages. Funnily enough, those algorithms can run on an 8bit computer in negligible time and require nowhere near the resources these Frankesteins need in order to sometimes get the count of Rs in strawberries right. Provided we train them with petabytes of data, and provide gigawatts of power.

CamperBob2 4 months ago | parent [-]

It's happened, you can wake up now.

But you'll just move the goalposts again, I imagine.

hansmayer 4 months ago | parent [-]

What goalposts? I am sorry, but as someone who has been using GitHub Copilot for quite some time now, I can tell you that unfortunately, no, it has not happened, and the evidence is there with every single prompt.