Remix.run Logo
jfim 10 hours ago

Counting letters is tricky for LLMs because they operate on tokens, not letters. From the perspective of a LLM, if you ask it "this is a sentence, count the letters in it" it doesn't see a stream of characters like we do, it sees [851, 382, 261, 21872, 11, 3605, 290, 18151, 306, 480].

tintor 7 hours ago | parent [-]

So what? It knows number of letters in each token, and can sum them together.

fzzzy 6 hours ago | parent [-]

How does it know the letters in the token?

It doesn't.

There's literally no mapping anywhere of the letters in a token.

ACCount37 an hour ago | parent | next [-]

There is a mapping. An internal, fully learned mapping that's derived from seeing misspellings and words spelled out letter by letter. Some models make it an explicit part of the training with subword regularization, but many don't.

It's hard to access that mapping though.

A typical LLM can semi-reliably spell common words out letter by letter - but it can't say how many of each are in a single word immediately.

But spelling the word out first and THEN counting the letters? That works just fine.

danielscrubs 4 hours ago | parent | prev [-]

If it did frequency analysis then I would consider it having a PhD level intelligence, not just a PhD level of knowledge (like a dictionary).