Remix.run Logo
cranberryturkey 5 hours ago

Biggest gap I see in most "LLM for practitioners" guides is they skip the evaluation piece. Getting a prompt working on 5 examples is easy — knowing if it actually generalizes across your domain is the hard part. Especially for analysts who are used to statistical rigor, the vibes-based evaluation most LLM tutorials teach feels deeply unsatisfying.

Does this guide cover systematic eval at all?

apwheele 5 hours ago | parent [-]

Totally agree it is critical. Each of chapters 4/5/6 have specific sections demonstrating testing. For structured outputs it goes through an example ground truth and calculating accuracy, demoing an example comparing Haiku 3 vs 4.5.

For Chapter 5 on RAG, it goes through precision/recall (with emphasis typically on recall for RAG systems).

For Chapter 6, I show a demo of LLM as a judge (using structured outputs to have specific errors it looks for) to evaluate a more fuzzy objective (writing a report based on table output).