| ▲ | robot-wrangler 4 hours ago | ||||||||||||||||||||||
> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning. Finally! This is a really obvious test-case that I've wondered about myself, and have seen many casual skeptics and cautiously optimistic people independently raising for several years now. When megacorp is not crowing about such a test, the silence is deafening, and it was practically guaranteed that they tested, didn't like the results, and didn't publish. I'm still surprised it took this long for academics to try it, and skimming cites, I don't see anything similar. Anyone know if this is the first paper to try this kind of thing, or just the first paper to put together a especially good suite of reusable benchies? If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro, and then we gradually run out of esolangs to do honest testing with. SAT is a whole different animal admittedly, but comparable honest tests might involve just forcing models to use randomly generated but easily checked EBNF grammar? I don't have a quick link to the relevant papers, but afaik benchmarks of strict adherence to non-simple JSON schemas is also still pretty bad, and we're just working around it with lots of retries/tokens. "But look how well it works for 10k lines of kubernetes manifests!" Well yeah, maybe, but it barely needs to really follow a schema since that is more stuff that's in the training set.. | |||||||||||||||||||||||
| ▲ | dinp 28 minutes ago | parent | next [-] | ||||||||||||||||||||||
> If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro https://x.com/lossfunk/status/2034637505916792886 "After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned" A little harness engineering was enough! | |||||||||||||||||||||||
| ▲ | GorbachevyChase an hour ago | parent | prev [-] | ||||||||||||||||||||||
I don’t have much confidence n the premise. Where was the human control? I think most Python programmers when tasked with “now do it in brainfuck” would fail. There is not much meaningful overlap in how to express intent and solutions to problems. The ridiculous syntax is the joke. But more importantly, I don’t have to solve any problems with languages that are elaborate practical jokes, so I’m not worried about the implications of an LLMs ability to be useful. | |||||||||||||||||||||||
| |||||||||||||||||||||||