Remix.run Logo
paraschopra 3 hours ago

(founder of Lossfunk, the lab behind this research.)

Esolang-Bench went viral on X. A lot of discussion ensued; addressing some of the common points that came up. Addressing a few questions about our Esolang-Bench. Hope it helps.

a) Why do it? Does it measure anything useful?

It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well?

The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that.

b) But humans can't also write esoteric languages well. It's an unfair comparison.

Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark.

However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now)

c) But Claude Code crushes it. You limited models artificially.

Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this.

After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better.

The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else?

d) So, are LLMs hyped? Or is our study clickbait?

The paper, code and benchmark are all open source.

We encourage whoever is interested to read it, and make up their own minds.

(We couldn't help notice that the same set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)