| ▲ | philsnow 2 hours ago | |
I'm reminded by the caveman skill of the clipped writing style used in telegrams, and your post further reminded me of "standard" books of telegram abbreviations. Take a look at [0]; could we train models to use this kind of code and then decode it in the browser? These are "rich" tokens (they succinctly carry a lot of information). [0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on... | ||
| ▲ | derefr 43 minutes ago | parent [-] | |
I would point out that the default BPE tokenization vocabulary used by many models (cl100k_base) is already a pretty powerful shorthand. It has a lot of short tokens, sure. But then: Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.") Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".) Token ID 44078 is " UnsupportedOperationException"! Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.) You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code! | ||