| ▲ | derefr 3 days ago | |
FYI, the LLM letter-counting problem has nothing to do with counting per se, and is instead entirely down to LLMs not getting to see your raw UTF-8 byte stream, but rather having a tokenizer intermediating between you and it, chunking your UTF-8 bytes into arbitrary, entirely-opaque-to-the-LLM token groupings. Try it for yourself — under the most popular tokenizer vocabulary (https://tiktokenizer.vercel.app/?model=cl100k_base), "strawberry" becomes [str][aw][berry]. Or, from the model's perspective, [496, 675, 15717]. The model doesn't know anything about how those numbers correspond to letters than you do! It never gets sat down and told "[15717] <=> [b][e][r][r][y]", with single-byte tokens on the right. (In fact, these single-byte tokens appear in the training data extremely rarely, and so the model doesn't often learn to do anything with them.) Note that LLMs can predictably count the number of r's in "s t r a w b e r r y", because <Count the number of r's in "s t r a w b e r r y"> becomes [Count][ the][ number][ of][ r]['s][ in][ "][s][ t][ r][ a][ w][ b][ e][ r][ r][ y]["]. And that's just a matching problem — [ r] tokens for [ r] tokens, no token-correspondence-mapping needed. | ||
| ▲ | orbital-decay 3 days ago | parent [-] | |
>entirely-opaque-to-the-LLM token groupings This is clearly not the case, any modern (non-reasoning) model easily decomposes words into individual token-characters (try separating them with e.g. Braille spaces...) and does arbitrary tokenization variants if forced with a sampler. It's way deeper than tokenization, and models struggle exactly with counting items in a list, exact ordering, retrieving scattered data, etc. LLM context works a lot more like associative memory than a sequence that can be iterated over. There are also fundamental biases and specific model quirks that lead to this. | ||