> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.

Finally! This is a really obvious test-case that I've wondered about myself, and have seen many casual skeptics and cautiously optimistic people independently raising for several years now. When megacorp is not crowing about such a test, the silence is deafening, and it was practically guaranteed that they tested, didn't like the results, and didn't publish.

I'm still surprised it took this long for academics to try it, and skimming cites, I don't see anything similar. Anyone know if this is the first paper to try this kind of thing, or just the first paper to put together a especially good suite of reusable benchies?

If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro, and then we gradually run out of esolangs to do honest testing with. SAT is a whole different animal admittedly, but comparable honest tests might involve just forcing models to use randomly generated but easily checked EBNF grammar? I don't have a quick link to the relevant papers, but afaik benchmarks of strict adherence to non-simple JSON schemas is also still pretty bad, and we're just working around it with lots of retries/tokens. "But look how well it works for 10k lines of kubernetes manifests!" Well yeah, maybe, but it barely needs to really follow a schema since that is more stuff that's in the training set..

▲

dinp 28 minutes ago | parent | next [-]

> If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro

https://x.com/lossfunk/status/2034637505916792886

"After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned"

A little harness engineering was enough!

▲

GorbachevyChase an hour ago | parent | prev [-]

I don’t have much confidence n the premise. Where was the human control? I think most Python programmers when tasked with “now do it in brainfuck” would fail. There is not much meaningful overlap in how to express intent and solutions to problems. The ridiculous syntax is the joke.

But more importantly, I don’t have to solve any problems with languages that are elaborate practical jokes, so I’m not worried about the implications of an LLMs ability to be useful.

▲

culi an hour ago | parent | next [-]

The point here is to test for "genuine reasoning" or something approaching it. If a model is truly reasoning it should be competent even in a new language you just made up (provided the language itself is competently designed)

	▲	wehnsdaefflae 35 minutes ago \| parent [-]
		So humans don't do "genuine reasoning"?

▲

dvt an hour ago | parent | prev [-]

> I don’t have to solve any problems with languages that are elaborate practical jokes

This is just being needlessly dismissive. Esolangs are (and have been) an area of active CS research for decades. I know I'm a bit of an esolang nerd, and while some are jokes, most focus on specific paradigms (e.g. Piet is visual, bf is a Turing tarpit, etc.).

> I think most Python programmers when tasked with “now do it in brainfuck” would fail.

This is untrue. Given internet-level awareness and infinite time, virtually all developers should be able to go from Python to brainfuck (trivially, I might add.) Did you even look at the test sets? It's all pretty basic stuff (palindromes, array traversal, etc.—we aren't using pandas here). I mean, sure, it would take forever and be mega annoying, but manipulating a head and some tape is hardly difficult.