Remix.run Logo
thesz 4 hours ago

That "underappreciated underlying structure or geometry" can be just an artifact of the same tokenization used with different models.

Tokenization breaks up collocations and creates new ones that are not always present in the original text as it was. Most probably, the first byte pair found by simple byte pair encoding algorithm in enwik9 will be two spaces next to each other. Is this a true collocation? BPE thinks so. Humans may disagree.

What does concern me here is that it is very hard to ablate tokenization artifacts.