| ▲ | topce 5 hours ago | |
Very Interesting... I have similar idea to train LLM in Serbian, create even new encoding https://github.com/topce/YUTF-8 inspired by YUSCII. Did not have time and money ;-) Great that you succeed. Idea if train in Serbian text encoded in YUTF-8 (not UTF-8) it will have less token when prompt in Serbian then English, also Serbian Cyrillic characters are 1 byte in YUTF-8 instead of 2 in UTF.Serbian language is phonetic we never ask how you spell it.Have Latin and Cyrillic letters. | ||
| ▲ | xodn348 5 hours ago | parent [-] | |
Really interesting approach — attacking token efficiency at the encoding level is more fundamental than what I did. Even without retraining BPE from scratch, starting with YUTF-8 and measuring how existing tokenizers handle it would already be a worthwhile experiment. Hope you find the time to build it, good luck! | ||