It's an important question! If you are paying a lot of money to use AI models, you care that you are using the best for your task. And it turns out that figuring out which AI models is best for your task is not trivial and requires some expertise.

▲

liveoneggs 22 minutes ago | parent | next [-]

They all change day to day and are non-deterministic by design. Your settled answer is only good for a moment.

▲

wseqyrku 3 hours ago | parent | prev | next [-]

That was too nice of a reply, I apologize. I just can't understand the thought process and that what exactly are we optimizing for? If you are paying a lot of money to use AI models, you already have so much overhead that precise ranking in an eval is not gonna make much difference between equally "frontier" models. Especially since models are sensitive to the input. So the eval is just gonna evaluate the eval with very high accuracy. It might be equivalent to the illusion of safety thing applied to financial risk.

	▲	thomasliao 3 hours ago \| parent \| next [-]
		>equally "frontier" models A key point I want to make is that the notion of "frontier" is somewhat fictive in the sense that a model which dominates all others on a given eval is not guaranteed to be number one on another eval, even if both evals are ostensibly for the same task. For example, the best publicly-available model (i.e. excluding Claude Mythos and Fable) on DeepSWE[0] is gpt-5.5-xhigh at 67%, which is soundly better than claude-opus-4.8-max at 59%. I would say an 8pp gap on a benchmark is quite large. But on FrontierCode[1], claude-opus-4.8-xhigh is the best, at a score of 13.4% compared to gpt-5.5-medium at 6.3%. That's quite a significant reversal! Now, one might wish to claim that either DeepSWE or FrontierCode is poorly constructed and that the other is more accurate. But I think you'll find that the degree to which eval-design considerations in this case affect measurement is probably of no less magnitude than user-specific considerations affect measurement in general. [0] https://deepswe.datacurve.ai/ [1] https://cognition.com/blog/frontier-code
	▲	unchar1 2 hours ago \| parent \| prev \| next [-]
		It's not just figuring out if a model is good at things, but is it good at the things I care about. Using a targeted eval suite (like a test suite) tells us that.
	▲	moomin 3 hours ago \| parent \| prev [-]
		It's not just for choice of model, you can use it for your prompting as well (basically anything to do with your setup). And yes, running evals is expensive and mostly of use to people with serious spend.

▲

lupire an hour ago | parent | prev [-]

But frontier models are constantly changing.