| ▲ | rockyj 2 hours ago | |
IMHO - In an AI context an "eval" is answering the question - "Is this AI / LLM call helping me or is doing the right thing?" AI is not deterministic like regular code, so imagine you use it for "search" (RAG) or for summarizing or for classifying emails etc. How do you know it is giving you the right results? In this context, AI evals are an important idea and very often neglected. You can use an initial "dataset" to evaluate your prompt and AI calls + code (think test cases), this dataset will of-course be curated by humans. But as the software is used, you want to incorporate, real production data as well and run the evaluation pre and post launch. Sounds simple, but can get complicated specially since this area is new and as the post mentioned there are too many players and options out there (since everyone thought this is a money maker). | ||