Mhh... my hunch is that part of this is that all python keywords are 1 token, I assume. And for those very weird languages, tokenizing might make it harder to reason over those tokens.

Would love to see how the benchmarks results change if the esoteric languages are changed a bit to make them have 1-token keywords only.

▲

chychiu 8 hours ago | parent [-]

Considering that brainfuck only has 8 characters and models are scoring at 6.2% I don't think tokenization is the issue

▲

altruios 8 hours ago | parent [-]

The only issue. *

Reasoning is hard, reasoning about colors while wearing glasses that obfuscate the real colors... even harder... but not the core issue if your brain not wired correctly to reason.

I suspect the way out of this is to separate knowledge from reason: to train reasoning with zero knowledge and zero language... and then to train language on top of a pre-trained-for-reasoning model.

	▲	onoesworkacct 5 hours ago \| parent [-]
		LLMs already use mixture of experts models, if you ensure the neurons are all glued together then (i think) you train language and reason simultaneously