▲ | abhgh 4 days ago | |
Evals somehow seem to be very very underrated, which is concerning in a world where we are moving towards (or trying to) systems with more autonomy. Your skepticism of "llm-as-a-judge" setups is spot on. If your LLM can make mistakes/hallucinate, then of course, your judge llm can too. In practice, you need to validate your judges and possibly adapt to your task based on sample annotated data. You might adapt them by trial and error, or prompt optimization, e.g., using DSPy [1], or learning a small correction model on top of their outputs, e.g., LLM-Rubric [2] or Prediction Powered Inference [3]. In the end, using the LLM as a judge confers just these benefits: 1. It is easy to express complex evaluation criteria. This does not guarantee correctness. 2. Seen as a model, it is easy to "train", i.e., you get all the benefits of in-context learning, e.g., prompt based, few-shot. But you still need to evaluate and adapt them. I have notes from a NeurIPS workshop from last year [4]. Btw, love your username! [2]https://aclanthology.org/2024.acl-long.745/ |