| ▲ | kingstnap 7 hours ago | |
It doesn't need to be. Text can be tokenized in many different ways even if the token set is the same. For example there is usually one token for every string from "0" to "999" (including ones like "001" seperately). This means there are lots of ways you can choose to tokenize a number. Like 27693921. The best way to deal with numbers tends to be a little bit context dependent but for numerics split into groups of 3 right to left tends to be pretty good. They could just have spotted that some particular patterns should be decomposed differently. | ||