Remix.run Logo
orthoxerox 7 hours ago

> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.

I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Or does this simply show that esolangs are hard to reason in by design? A more honest approach would use a "real", but relatively unpopular, language. Make them use CoffeeScript or Ada or PL/I or Odin or that other systems programming language that that very opinionated guy is implementing on top of QBE.

onoesworkacct 6 hours ago | parent | next [-]

Unlike AI, you aren't able to regurgitate entire programs and patterns you've seen before.

AI's capacity for memorisation is unrivaled, I find it mind blowing that you can download a tiny ~4gb model and it will have vastly more general knowledge than an average human (considering that the human is more likely to be wrong if you ask it trivia about e.g. the spanish civil war).

But the average human still has actual reasoning capabilities, which is still (I think?) a debated point with AI.

refulgentis 5 hours ago | parent [-]

> which is still (I think?) a debated point with AI.

It's not, people misread an Apple study and it became a meme. It lost currency as a meme because it is impossible to use a model in 2026 and come away with the idea it cannot reason, for any reasonable definition of the word reason (pun intended). Most of the debate from there is just people misreading each-other and imagining incentive structures at play. (ex. I am not claiming they are never stupid, ex. the car wash dilemma, but I am claiming its gee-whiz enough at enough that it's become de facto beyond honest debate)

> AI's capacity for memorisation is unrivaled,

Much like "it just memorizes training data", "memorization" has a kernel of truth to it. Memorizing does not imply "it has 100% "learned", for some definition of learned similar to "guaranteed 100% reproducible translatable computation", brainfuck to the point it's just as easy as writing any other program, and thus if it hasn't, it cannot reason"

At the end of the day these are just mathematical objects. And while it's not discourse-contributing, the mundane truth is, those matmuls born from boring curve-fitting at scale know/memorized/can reason about/can parrot/have adjusted the float32s in such a way that it produces C a lot better than Brainfuck. Much like us. But they're just matmuls curve-fitting at scale.

qsera 2 hours ago | parent [-]

> and come away with the idea it cannot reason

Reason and "appearance" of reasoning are two different things. Some people intrinsically understand this. And some does not, and those people can never be made to understand it. I think it is one you things that you either get it automatically, or not get it at all..

GorbachevyChase an hour ago | parent [-]

So does a human engaged in rationalization or confabulation just appear to reason? We might be closer to these machines than you think, and I don’t mean that in a positive way.

NateEag 16 minutes ago | parent [-]

Not OP, but as an LLM skeptic, I'd absolutely say that humans are natively very poor reasoners.

With effort, support, and resources, we can learn to reason well from first principles - call it reaching "intellectual maturity."

Catch an emotionally-immature human in a mistake or conflicting set of beliefs, and you'll be able to see them do exactly what you describe above: rationalize, deflect, and twist the data to support a more emotionally-comfortable narrative.

That usually holds even for intellectually-mature individuals who have not yet matured emotionally, even though they may reason quite well when the stakes are low.

Humans that have matured both emotionally and intellectually, however, are often able to keep themselves stable and reason well even in difficult circumstances.

The ways LLMs consistently fail spectacularly on out-of-distribution problems (like these esolangs) do seem to suggest they don't really mature intellectually, not the way humans can.

Maybe the Wiggum loop strategy shows otherwise? I'm not sure I know.

To me, it smells more like brute-forcing through to a result without fully understanding the problem, though.

IsTom 7 hours ago | parent | prev | next [-]

Just look what kind of problems the easy task set is (hello world, echo line, count vowels, etc.). With best being ~10% of total in brainfuck this is 10 out of 20. You can google more solutions to these problems than that.

voxl 6 hours ago | parent [-]

It's pointless to argue, we exist in world of "this technology will usher in the singularity" versus "this tech is useful but come on"

The singularity crowd has never listened to reason and never will.

andai 7 hours ago | parent | prev | next [-]

Yeah there seem to be two axes here.

Esolang vs mainstream paradigm.

Popular vs scarce training data.

So you'd want to control for training data (e.g. brainfuck vs Odin?)

And ideally you'd control by getting it down to 0, i.e. inventing new programming languages with various properties and testing the LLMs on those.

I think that would be a useful benchmark for other reasons. It would measure the LLMs' ability to "learn" on the spot. From what I understand, this remains an underdeveloped area of their intelligence. (And may not be solvable with current architectures.)

astrange 6 hours ago | parent | prev | next [-]

> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

It doesn't even prove the models do that. The RLVR environments being mostly Python isn't "training data memorization". That's just the kind of dumb thing people say to sound savvy.

iloveoof 7 hours ago | parent | prev | next [-]

Try MUMPS, widely used but little training data online. Probably less than some esolangs

twoodfin 6 hours ago | parent [-]

Frontier models have gotten much better at ObjectScript (the InterSystems evolution of MUMPS/M).

Palindrome:

https://chatgpt.com/s/t_69bc8d8c116c8191a339a33f0fbcc935

This is a noticeable improvement from a year ago.

I wish it would use Return instead of Quit but that’s a stochastic parrot for you.

wavemode 7 hours ago | parent | prev | next [-]

> I would probably score about the same, does this prove I also rely on training data memorization rather than genuine programming reasoning?

Setting aside whether this benchmark is meaningful or not - the argument you're making is faulty. There are indeed humans who can write complete programs in Brainfuck and these other esolangs. The fact that you personally can't is not logically relevant.

Groxx 7 hours ago | parent [-]

particularly if you'd already read approximately all written material in existence about those languages. many humans are capable of learning a language from the documentation.

ventisk1ze 4 hours ago | parent [-]

[dead]

Pamar 4 hours ago | parent | prev | next [-]

I had similar experiences with an unpopular but not "esoteric" language (Progress ABL) and so did some other developers in my team.

derrak 3 hours ago | parent | prev [-]

I don’t know your background, but suspect that if you were given sufficient motivation, you could solve these problems in an esoteroic language. It might be tedious, but I suspect that most anyone with an undergraduate degree in computer science and sufficient experience in a couple programming languages could meet the task.