Remix.run Logo
theteapot 4 hours ago

What's an eval?

choult 4 hours ago | parent | next [-]

Evaluations of different implementations of a tech. Kind of like a meta service layer on top of an industry, such as "Which frontier model is best?"

I do agree that the author does not do a good job of introducing the term.

wseqyrku 4 hours ago | parent [-]

"Which frontier model is best?"

What kind of stupid business is this. Though nothing can beat SEO in that spirit.

thomasliao 3 hours ago | parent [-]

It's an important question! If you are paying a lot of money to use AI models, you care that you are using the best for your task. And it turns out that figuring out which AI models is best for your task is not trivial and requires some expertise.

liveoneggs 27 minutes ago | parent | next [-]

They all change day to day and are non-deterministic by design. Your settled answer is only good for a moment.

wseqyrku 3 hours ago | parent | prev | next [-]

That was too nice of a reply, I apologize. I just can't understand the thought process and that what exactly are we optimizing for? If you are paying a lot of money to use AI models, you already have so much overhead that precise ranking in an eval is not gonna make much difference between equally "frontier" models. Especially since models are sensitive to the input. So the eval is just gonna evaluate the eval with very high accuracy. It might be equivalent to the illusion of safety thing applied to financial risk.

thomasliao 3 hours ago | parent | next [-]

>equally "frontier" models

A key point I want to make is that the notion of "frontier" is somewhat fictive in the sense that a model which dominates all others on a given eval is not guaranteed to be number one on another eval, even if both evals are ostensibly for the same task.

For example, the best publicly-available model (i.e. excluding Claude Mythos and Fable) on DeepSWE[0] is gpt-5.5-xhigh at 67%, which is soundly better than claude-opus-4.8-max at 59%. I would say an 8pp gap on a benchmark is quite large. But on FrontierCode[1], claude-opus-4.8-xhigh is the best, at a score of 13.4% compared to gpt-5.5-medium at 6.3%.

That's quite a significant reversal!

Now, one might wish to claim that either DeepSWE or FrontierCode is poorly constructed and that the other is more accurate. But I think you'll find that the degree to which eval-design considerations in this case affect measurement is probably of no less magnitude than user-specific considerations affect measurement in general.

[0] https://deepswe.datacurve.ai/ [1] https://cognition.com/blog/frontier-code

unchar1 2 hours ago | parent | prev | next [-]

It's not just figuring out if a model is good at things, but is it good at the things I care about.

Using a targeted eval suite (like a test suite) tells us that.

moomin 3 hours ago | parent | prev [-]

It's not just for choice of model, you can use it for your prompting as well (basically anything to do with your setup). And yes, running evals is expensive and mostly of use to people with serious spend.

lupire an hour ago | parent | prev [-]

But frontier models are constantly changing.

thomasliao 4 hours ago | parent | prev | next [-]

(Author) It's short for "evaluation", a test for an AI model. Specifically, an AI evaluation comprises (1) a dataset of prompts (as questions / tasks / queries), (2) some way to score model performance on each prompt, like a set of correct answers or a grading rubric that you can use with an LLM autograder, and (3) a metric, such as accuracy¹. (If you're already familiar with the term "benchmark", it's the same thing; for some reason the former has become the term of art in the past few years).

For example, a simple eval is a dataset of multiple-choice questions, which each have one correct answer, and scored by accuracy. An example of this kind of eval is the Massive Multitask Language Understanding benchmark (2020) (https://arxiv.org/abs/2009.03300).

A more complex eval is FrontierCode (2026). Questions in FrontierCode represent coding tasks needed for real-world repos and are evaluated against rubrics scoring for correctness, code quality, cleanliness, and other factors. https://cognition.com/blog/frontier-code.

¹Note that this is a slightly different definition we used in [0], which used a definition of a fixed input-output correspondence pairs combined with a metric. What's different from 2021 is: models are now given more open-ended inputs (prompts like "find the bug" and a codebase rather than a simple question), have freeform generation (rather than choosing a fixed answer), and are graded in a more complex manner (e.g. beyond correctness, one might care for a coding eval also to grade adherence to coding guidelines, test coverage, etc).

[0] Liao, T., Taori, R., Raji, I. D., & Schmidt, L. (2021, January). Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://thomasliao.com/are_we_learning_yet.pdf

jorisw 3 hours ago | parent | next [-]

Would've started the article out alluding to this, or added a tooltip or something to this effect

jillesvangurp 42 minutes ago | parent | prev [-]

That sounds a bit weak as a startup idea. Hard to productize, hard to scale, etc. It sounds more like consulting.

diegof79 an hour ago | parent | prev | next [-]

To complement the excellent answers that I read in this thread: an eval is a test.

What makes it particular for the case of AI is:

- there are many situations where you can’t test using pattern matching

- you don’t only like to test correct answers but voice and tone too (imagine a bank support LLM-based chatbot that answers using slang)

- evals can be used to compare the performance of different implementations; given the costs of LLMs, it’s very important

- running evals is more expensive than running standard tests, because you rely on the LLM calls under test, and many times they use LLMs as a judge. It means that running them in every commit of your CI/CD is very expensive

- Knowing all the possible inputs for the LLM is impossible, so evals can also be run on runtime samples to detect anomalies

rockyj 2 hours ago | parent | prev [-]

IMHO - In an AI context an "eval" is answering the question - "Is this AI / LLM call helping me or is doing the right thing?"

AI is not deterministic like regular code, so imagine you use it for "search" (RAG) or for summarizing or for classifying emails etc. How do you know it is giving you the right results? In this context, AI evals are an important idea and very often neglected.

You can use an initial "dataset" to evaluate your prompt and AI calls + code (think test cases), this dataset will of-course be curated by humans. But as the software is used, you want to incorporate, real production data as well and run the evaluation pre and post launch. Sounds simple, but can get complicated specially since this area is new and as the post mentioned there are too many players and options out there (since everyone thought this is a money maker).