This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.

▲ cryptonector an hour ago | parent | next [-]

Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.

▲ moelf 4 hours ago | parent | prev | next [-]

the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...

▲ make3 3 hours ago | parent | prev | next [-]

If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.

	▲	xigoi 2 hours ago \| parent [-]
		I imagine that having to write `for (int index = 0; index < size; ++index)` instead of `for index in 0...size` eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.

▲ muyuu 4 hours ago | parent | prev [-]

You could, but you wouldn't when those keywords can all change in equivalent contexts.

	▲	eru 4 hours ago \| parent [-]
		What do you mean? `public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.