Its pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure. There will always be an incentive to optimize specifically for these benchmarks even if just for marketing material. Sure there is a training cutoff, but its usually only 3-6 months off of the public release dates.

The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks.

In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material

Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.

▲

mnky9800n 10 hours ago | parent | next [-]

This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench.

https://github.com/mnky9800n/zork-bench

▲

kqr 9 hours ago | parent | next [-]

I have worked on similar problems. See e.g. [1].

The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.

[1]: https://entropicthoughts.com/updated-llm-benchmark

(more descriptions available in earlier evaluations referenced from there)

▲

malfist 7 hours ago | parent | next [-]

I'm going to ignore all that and tell my developers working in complicated codebases that they have to use AI. I'm sure comprehending side effects in a world building text adventure is completely different that understanding spaghetti code

▲

red75prime 6 hours ago | parent [-]

Desarcasmed version: "I think that problems with Zork make those models virtually useless in programming tasks." Correct?

	▲	cptskippy 2 hours ago \| parent [-]
		He said complicated code bases. LLMs are great at producing small snippets of code to address very targeted problems.

▲

seanmcdirmid 6 hours ago | parent | prev | next [-]

You can code your prompts to read and write an external world model on the side. This is what most people do who are seriously doing games with LLMs.

▲

mnky9800n 8 hours ago | parent | prev [-]

we should talk. i sent you an email.

▲

WarmWash 9 hours ago | parent | prev | next [-]

The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)

It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.

▲

roenxi 8 hours ago | parent [-]

Have the open models been tried? When I look at the leaderboard [0] the only qwen model I see is 235B-A22B. I wouldn't expect an MoE model to do particularly well, from what I've seen (thinking mainly of a leaderboard trying to measure EQ [1]) MoE models are at a distinct disadvantage to regular models when it comes to complex tasks that aren't software benchmark targets.

[0] https://arcprize.org/leaderboard

[1] https://eqbench.com/index.html

	▲	WarmWash 7 hours ago \| parent [-]
		There is GLM 5 and kimi 2.5 (which gets 11.8%, but I digress)

▲

CamperBob2 8 hours ago | parent | prev | next [-]

Actually the Zorks weren't deterministic, especially Zork II. The Wizard could F you over pretty badly if he appeared at an inopportune time.

▲

doingthehula 5 hours ago | parent | prev [-]

[dead]

▲

cbg0 8 hours ago | parent | prev | next [-]

> let the community decide

Which community are we talking about? The professionals with 10+ years experience using LLMs, the vibe coders that have no experience writing code and everyone in between? If you read some of the online communities the experiences with the models all over the place, some compare GPT 5.5 to the second coming of JC while others think it's stupider than 5.4.

I personally don't have time to build a set of private benchmarks to compare the models that are coming out so I'm mostly relying on private and semi-private benchmarks to get a feel for how models are improving before I subscribe to a service and start using it myself. At least it's something a bit more reliable than the vibes of random people and bots on reddit.

	▲	trueno 4 hours ago \| parent [-]
		yea lol i think the community on this one is woefully unqualified to call any shots here. the goalposts are basically teleporting and everyone's aligning success with their own incredibly vague, personally created agentic non-deterministic workflows success. there's like no real answers coming from "the community" in this space at the moment, it's vividly similar to cryptocurrency cycles. most importantly, like you say, vibe coders are going to be the largest subset of the community and probably the most unqualified to assess performance because they're mostly clueless to how things work under the hood.

▲

WarmWash 9 hours ago | parent | prev | next [-]

An easy way to make coding benchmarks viable again is to initialize the models with 200k of distracting or unrelated tokens in their context. Or even just run the tests sequentially in the same context and see how far the model gets before it unwinds.

These benchmarks are always greenfield, but people want a model that can deal with a rotted context.

▲

adamandsteve 8 hours ago | parent | prev | next [-]

"The community" is astroturfed as hell though. Anthropic pays influencers to promote Claude Code and likely bots a ton as well, so it's hard to come to any kind of consensus online. Even if everyone was acting in good faith, some people will have a much better experience than others because of the domain they're working in (e.g. AI being much better at frontend and commonly used libraries).

The only real way to evaluate a model is to test it yourself but that's exhausting for each new model and not comprehensive anyway.

▲

InsideOutSanta 7 hours ago | parent [-]

Yeah, it's crazy that there is no trustworthy source for model reviews. I'd love to know how well the new Deepseek 4 actually performs, for example, but I don't want to spend the next week testing it out. Reddit used to be a somewhat useful gauge, but now there are posts on how 4 is useless right next to posts on how amazing it is. And I have no idea if this is astroturfing, or somebody using a quantized version, or different workloads, or what.

I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?

It's all just so opaque.

	▲	rhdunn 7 hours ago \| parent [-]
		One challenge is that model evaluation is typically domain/application specific. Model performance can also depend on the system prompt and the input/context. Regarding evaluation, I've found using tools like promptfoo (and in some cases custom tools built on top of that) are useful. These help when evaluating new models/versions and when modifying the system prompt to guide the model. Especially if you can define visualizations and assertions to accurately test what you are trying to achieve. This can be difficult for tasks like summarization, code generation, or creative writing that don't have clear answers. Though having some basic evaluation metrics and test cases can still be useful, and being able to easily do side-by-side comparisons by hand.

▲

jvuygbbkuurx 10 hours ago | parent | prev | next [-]

I think the solution is a bunch of private trusted benchmarks, and averaging their announced results.

	▲	zephen 7 hours ago \| parent [-]
		> averaging their announced results. Obligatory XKCD: https://xkcd.com/937/

▲

Escapado 8 hours ago | parent | prev | next [-]

I agree with the sentiment but I wonder if a sufficiently large amount of sufficiently sophisticated benchmarks existed then I would be surprised if a model would only memorize those benchmarks while showing terrible real world performance. We are not there yet but maybe one day we will be.

▲

AntiUSAbah 7 hours ago | parent | prev | next [-]

In contrary: In an Interview someone from OpenAI said they are trying to avoid it because it makes it harder for them to determine if a model gets better or not.

	▲	thesz 3 hours ago \| parent [-]
		Perturbation of dataset used for training can introduce adversarial behavior even without adding any other data, and idea is quite simple: you take two batches from the dataset for training and select model with more probable adversarial behavior. The more batches with posterior selection get processed, the more probable adversarial behavior become. By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.

▲

MattRix 9 hours ago | parent | prev | next [-]

They mention this in the article. This is why private (non public) benchmark tasks that have been made from scratch are necessary.

▲

cyanydeez 9 hours ago | parent | prev [-]

a good benchmark would probably porting a selected repo to another language. then clear context notes, and have it port it back.

as long as theres a test framework, you could gauge success deterministically.