Remix.run Logo
staticshock 3 days ago

LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.

derefr 3 days ago | parent | next [-]

FYI, the LLM letter-counting problem has nothing to do with counting per se, and is instead entirely down to LLMs not getting to see your raw UTF-8 byte stream, but rather having a tokenizer intermediating between you and it, chunking your UTF-8 bytes into arbitrary, entirely-opaque-to-the-LLM token groupings.

Try it for yourself — under the most popular tokenizer vocabulary (https://tiktokenizer.vercel.app/?model=cl100k_base), "strawberry" becomes [str][aw][berry]. Or, from the model's perspective, [496, 675, 15717]. The model doesn't know anything about how those numbers correspond to letters than you do! It never gets sat down and told "[15717] <=> [b][e][r][r][y]", with single-byte tokens on the right. (In fact, these single-byte tokens appear in the training data extremely rarely, and so the model doesn't often learn to do anything with them.)

Note that LLMs can predictably count the number of r's in "s t r a w b e r r y", because <Count the number of r's in "s t r a w b e r r y"> becomes [Count][ the][ number][ of][ r]['s][ in][ "][s][ t][ r][ a][ w][ b][ e][ r][ r][ y]["]. And that's just a matching problem — [ r] tokens for [ r] tokens, no token-correspondence-mapping needed.

orbital-decay 3 days ago | parent [-]

>entirely-opaque-to-the-LLM token groupings

This is clearly not the case, any modern (non-reasoning) model easily decomposes words into individual token-characters (try separating them with e.g. Braille spaces...) and does arbitrary tokenization variants if forced with a sampler. It's way deeper than tokenization, and models struggle exactly with counting items in a list, exact ordering, retrieving scattered data, etc. LLM context works a lot more like associative memory than a sequence that can be iterated over. There are also fundamental biases and specific model quirks that lead to this.

8note 3 days ago | parent | prev | next [-]

> When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries".

no it doesnt. it makes sense that they cant count the rs because they dont have access to the actual word, only tokens that might represent parts or the whole of the word

orbital-decay 3 days ago | parent [-]

Tokenization is a simplistic explanation which is likely wrong, at least in part. They're perfectly fine reciting words character by character, using different tokenization strategies for the same word if forced to (e.g. replacing the starting space or breaking words up into basic character tokens), complex word formation in languages that heavily depend on it, etc. LLMs work with concepts rather than tokens.

im3w1l 3 days ago | parent | prev [-]

A big part of skill aquisition in humans is moving tasks from system 2 to system 1, to free up the very scarce thinking resources for ever more complex tasks, that can then in turn be internalized and handled by system 1.