Remix clone Hacker News

new | show | ask | jobs Github

	▲	apwheele 5 hours ago
		Totally agree it is critical. Each of chapters 4/5/6 have specific sections demonstrating testing. For structured outputs it goes through an example ground truth and calculating accuracy, demoing an example comparing Haiku 3 vs 4.5. For Chapter 5 on RAG, it goes through precision/recall (with emphasis typically on recall for RAG systems). For Chapter 6, I show a demo of LLM as a judge (using structured outputs to have specific errors it looks for) to evaluate a more fuzzy objective (writing a report based on table output).