| ▲ | nextaccountic a day ago | |
Isn't giving this word a token something deeply wasteful? When some more common things are multiple tokens. Indeed, how do they deal with Chinese? Are some ideograms multiple tokens? | ||
| ▲ | mudkipdev a day ago | parent [-] | |
It simply means the tokenizer's training corpus may have included a massive amount of German literature or accidentally oversampled a web page where that word was frequently repeated. Look up "glitch tokens" to learn more. | ||