| ▲ | cranberryturkey 5 hours ago | |
Biggest gap I see in most "LLM for practitioners" guides is they skip the evaluation piece. Getting a prompt working on 5 examples is easy — knowing if it actually generalizes across your domain is the hard part. Especially for analysts who are used to statistical rigor, the vibes-based evaluation most LLM tutorials teach feels deeply unsatisfying. Does this guide cover systematic eval at all? | ||
| ▲ | apwheele 5 hours ago | parent [-] | |
Totally agree it is critical. Each of chapters 4/5/6 have specific sections demonstrating testing. For structured outputs it goes through an example ground truth and calculating accuracy, demoing an example comparing Haiku 3 vs 4.5. For Chapter 5 on RAG, it goes through precision/recall (with emphasis typically on recall for RAG systems). For Chapter 6, I show a demo of LLM as a judge (using structured outputs to have specific errors it looks for) to evaluate a more fuzzy objective (writing a report based on table output). | ||