Remix.run Logo
janalsncm 5 hours ago

This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.

cryptonector an hour ago | parent | next [-]

Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.

moelf 4 hours ago | parent | prev | next [-]

the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...

make3 3 hours ago | parent | prev | next [-]

If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.

xigoi 2 hours ago | parent [-]

I imagine that having to write

  for (int index = 0; index < size; ++index)
instead of

  for index in 0...size
eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.
muyuu 4 hours ago | parent | prev [-]

You could, but you wouldn't when those keywords can all change in equivalent contexts.

eru 4 hours ago | parent [-]

What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.