Remix.run Logo
apwheele 5 hours ago

Totally agree it is critical. Each of chapters 4/5/6 have specific sections demonstrating testing. For structured outputs it goes through an example ground truth and calculating accuracy, demoing an example comparing Haiku 3 vs 4.5.

For Chapter 5 on RAG, it goes through precision/recall (with emphasis typically on recall for RAG systems).

For Chapter 6, I show a demo of LLM as a judge (using structured outputs to have specific errors it looks for) to evaluate a more fuzzy objective (writing a report based on table output).