| ▲ | theteapot 4 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What's an eval? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | choult 4 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Evaluations of different implementations of a tech. Kind of like a meta service layer on top of an industry, such as "Which frontier model is best?" I do agree that the author does not do a good job of introducing the term. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | thomasliao 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(Author) It's short for "evaluation", a test for an AI model. Specifically, an AI evaluation comprises (1) a dataset of prompts (as questions / tasks / queries), (2) some way to score model performance on each prompt, like a set of correct answers or a grading rubric that you can use with an LLM autograder, and (3) a metric, such as accuracy¹. (If you're already familiar with the term "benchmark", it's the same thing; for some reason the former has become the term of art in the past few years). For example, a simple eval is a dataset of multiple-choice questions, which each have one correct answer, and scored by accuracy. An example of this kind of eval is the Massive Multitask Language Understanding benchmark (2020) (https://arxiv.org/abs/2009.03300). A more complex eval is FrontierCode (2026). Questions in FrontierCode represent coding tasks needed for real-world repos and are evaluated against rubrics scoring for correctness, code quality, cleanliness, and other factors. https://cognition.com/blog/frontier-code. ¹Note that this is a slightly different definition we used in [0], which used a definition of a fixed input-output correspondence pairs combined with a metric. What's different from 2021 is: models are now given more open-ended inputs (prompts like "find the bug" and a codebase rather than a simple question), have freeform generation (rather than choosing a fixed answer), and are graded in a more complex manner (e.g. beyond correctness, one might care for a coding eval also to grade adherence to coding guidelines, test coverage, etc). [0] Liao, T., Taori, R., Raji, I. D., & Schmidt, L. (2021, January). Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://thomasliao.com/are_we_learning_yet.pdf | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | diegof79 an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To complement the excellent answers that I read in this thread: an eval is a test. What makes it particular for the case of AI is: - there are many situations where you can’t test using pattern matching - you don’t only like to test correct answers but voice and tone too (imagine a bank support LLM-based chatbot that answers using slang) - evals can be used to compare the performance of different implementations; given the costs of LLMs, it’s very important - running evals is more expensive than running standard tests, because you rely on the LLM calls under test, and many times they use LLMs as a judge. It means that running them in every commit of your CI/CD is very expensive - Knowing all the possible inputs for the LLM is impossible, so evals can also be run on runtime samples to detect anomalies | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rockyj 2 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IMHO - In an AI context an "eval" is answering the question - "Is this AI / LLM call helping me or is doing the right thing?" AI is not deterministic like regular code, so imagine you use it for "search" (RAG) or for summarizing or for classifying emails etc. How do you know it is giving you the right results? In this context, AI evals are an important idea and very often neglected. You can use an initial "dataset" to evaluate your prompt and AI calls + code (think test cases), this dataset will of-course be curated by humans. But as the software is used, you want to incorporate, real production data as well and run the evaluation pre and post launch. Sounds simple, but can get complicated specially since this area is new and as the post mentioned there are too many players and options out there (since everyone thought this is a money maker). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||