"500 code samples generated by Magistral-24B" — So you didn't use real code?

The paper is totally mum on how "descriptive" names (e.g. process_user_input) differ from "snake_case" names (e.g. process_user_input).

The actual question here is not about the model but merely about the tokenizer: is it the case that e.g. process_user_input encodes into 5 tokens, ProcessUserInput into 3, and calcpay into 1? If you don't break down the problem into simple objective questions like this, you'll never produce anything worth reading.

▲

ijk 4 days ago | parent [-]

True - though in the actual case of your examples, calcpay, process_user_input, and ProcessUserInput all encode into exactly 3 tokens with GPT-4.

Which is the exact kind of information that you want to know.

In practice, I'd expect the performance difference to be relatively minimal, as input tokens tends to quickly get aggregated into more general concepts. But that's the kind of question that's worth getting metrics on: my intuition suggests one answer, but do the numbers actually hold up when you actually measure it?

	▲	quuxplusone 3 days ago \| parent [-]
		Awesome! You should have written this blog post instead of that guy. :)