| ▲ | skerit 2 days ago |
| Neat. Is it a single under-trained token in GPT-5.2? Or is something else going on? |
|
| ▲ | WatchDog 2 days ago | parent | next [-] |
| Perhaps, the word does have it's own token, " geschniegelt"(geschniegelt with a space in front of it), is token 192786 in the tokenizer that GPT-5 apparently uses. https://raw.githubusercontent.com/niieani/gpt-tokenizer/refs... |
| |
| ▲ | nextaccountic a day ago | parent [-] | | Isn't giving this word a token something deeply wasteful? When some more common things are multiple tokens. Indeed, how do they deal with Chinese? Are some ideograms multiple tokens? | | |
| ▲ | mudkipdev a day ago | parent [-] | | It simply means the tokenizer's training corpus may have included a massive amount of German literature or accidentally oversampled a web page where that word was frequently repeated. Look up "glitch tokens" to learn more. |
|
|
|
| ▲ | magicalhippo 2 days ago | parent | prev | next [-] |
| Based on their tokenizer tool[1], for GPT 5.x "geschniegelt" is tokenized into three tokens: (ges)(chn)(iegelt)
[1]: https://platform.openai.com/tokenizer |
| |
| ▲ | Tiberium 2 days ago | parent [-] | | It's a single token in the most common usage, that is, with a space in front of it "This word is geschniegelt" is [2500, 2195, 382, 192786] Last token here is " geschniegelt" | | |
| ▲ | nialv7 2 days ago | parent [-] | | Maybe this is why? Most of the training data has the single token version, so the three tokens version was undertrained? |
|
|
|
| ▲ | 2 days ago | parent | prev [-] |
| [deleted] |