Remix.run Logo
magicalhippo 2 days ago

Based on their tokenizer tool[1], for GPT 5.x "geschniegelt" is tokenized into three tokens:

  (ges)(chn)(iegelt)
[1]: https://platform.openai.com/tokenizer
Tiberium 2 days ago | parent [-]

It's a single token in the most common usage, that is, with a space in front of it

"This word is geschniegelt" is [2500, 2195, 382, 192786]

Last token here is " geschniegelt"

nialv7 2 days ago | parent [-]

Maybe this is why? Most of the training data has the single token version, so the three tokens version was undertrained?