| ▲ | ashirviskas 6 hours ago | |||||||
I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance. Experiments I want to build on top of it: 1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary. I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong. | ||||||||
| ▲ | Yemoshino 4 hours ago | parent | next [-] | |||||||
I found a few papers in this direction with perplexity like this one https://ceur-ws.org/Vol-4005/paper1.pdf and it doesn't seem to be that relevant for now. The progress of a handful models seem to be so much better (because limited compute, we have only a handful of big ones, i presume) that these finetunings are just not yet relevant. I'm also curious if a english java + html + css + javascript only model would look like in size and speed for example. Unfortunate whenever i ask myself the question of finetunging tokens (just a few days ago this question came up again), deep diving takes too much time. Claude only got lsp support in november i think. And its not even clear to me to what extend. So despite the feeling we are moving fast, tons of basic ideas haven't even made it in yet | ||||||||
| ▲ | stephantul 2 hours ago | parent | prev | next [-] | |||||||
There’s many examples of noisily encoding a large embedding vocabulary. This sounds a bit like T-free or H-net? Or BLT? One of the main issues with lines of work around this are that you end up trading embedding parameters for active parameters. This is rarely a good trade-off for the sake of compute. | ||||||||
| ▲ | appplication 4 hours ago | parent | prev [-] | |||||||
Not an expert in the space, but I’m not sure you need to modify tokens to get the model to see syntax, you basically get that exact association from attention. | ||||||||
| ||||||||