Remix.run Logo
Finding Optimal Tokenizers(blog.aqnichol.com)
24 points by mcyc 15 hours ago | 1 comments
fxtentacle 22 minutes ago | parent [-]

This is an interesting approach with integer programming and then using an explicit solver. It’s probably very slow, but you only have to run this once and it produces the mathematically perfect result.

In the past, I got good results with trying to reduce the variance in entropy in-between tokens, which you can implement very easily by starting with each single character as its own token and then doing a greedy merge of the most numerous outlier token pairs in a loop until you reach your desired token count. https://arxiv.org/abs/2206.12693