Remix.run Logo
derefr 2 hours ago

I would point out that the default BPE tokenization vocabulary used by many models (cl100k_base) is already a pretty powerful shorthand. It has a lot of short tokens, sure. But then:

Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.")

Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".)

Token ID 44078 is " UnsupportedOperationException"!

Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.)

You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code!