I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance.

Experiments I want to build on top of it:

1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.

I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.

▲

Yemoshino 4 hours ago | parent | next [-]

I found a few papers in this direction with perplexity like this one https://ceur-ws.org/Vol-4005/paper1.pdf and it doesn't seem to be that relevant for now.

The progress of a handful models seem to be so much better (because limited compute, we have only a handful of big ones, i presume) that these finetunings are just not yet relevant.

I'm also curious if a english java + html + css + javascript only model would look like in size and speed for example.

Unfortunate whenever i ask myself the question of finetunging tokens (just a few days ago this question came up again), deep diving takes too much time.

Claude only got lsp support in november i think. And its not even clear to me to what extend. So despite the feeling we are moving fast, tons of basic ideas haven't even made it in yet

▲

stephantul 2 hours ago | parent | prev | next [-]

There’s many examples of noisily encoding a large embedding vocabulary. This sounds a bit like T-free or H-net? Or BLT?

One of the main issues with lines of work around this are that you end up trading embedding parameters for active parameters. This is rarely a good trade-off for the sake of compute.

▲

appplication 4 hours ago | parent | prev [-]

Not an expert in the space, but I’m not sure you need to modify tokens to get the model to see syntax, you basically get that exact association from attention.

	▲	ashirviskas 2 hours ago \| parent [-]
		You get that association that is relevant to your project only if you can cram the whole codebase. Otherwise it is making rough estimates and some of the time that seems to be where the models fail. It can only be fully resolved with either infinite context length, or doing it similar to how humans do it - add some LSP "color" to the code tokens. You can get a feel of what LLMs deal with when you try opening 3000 lines of code in a simple text editor and try to do something. May work for simple fixes, but not whole codebase refactors. Only ultra skilled humans can be productive in it (using my subjective definition of "productive")