▲ | singularity2001 7 days ago | |
Working on variants of Byte Latent Transformer [0] to get rid of tokenization which hinders mathematical performance and letter reflection. In the original Byte Latent Transformer paper they reintroduce ugly caching and n-grams which I'm looking to eliminate. As expected pure byte level Transformers need some rethinking to keep them performant, some kind of matryoshka mechanism so that long predictable byte sequences (words and phrases) get grouped into a single latent vector. The idea is to apply this "Byteformer" not just on text but also on compiled files, songs etc. If it's impossible to scale this architecture at least a modified tokenizer could be helpful which falls back to bytes / unicode once a number or an unfamiliar word is encountered. |