Remix.run Logo
dinp 2 hours ago

> If this benchmark becomes popular, then presumably to avoid such embarrassments synthetic data is eventually added to training sets to make sure even esolangs are somewhat more in-distro

https://x.com/lossfunk/status/2034637505916792886

"After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned"

A little harness engineering was enough!