▲ | NitpickLawyer 8 hours ago | |
I think there's more nuance, and the way I read the article is more "beware of these shortcomings", instead of "aren't good". LLM-based evaluation can be good. Several models have by now been trained on previous-gen models used in filtering data and validating RLHF data (pairwise or even more advanced). LLama3 is a good example of this. My take from this article is that there are plenty of gotchas along the way, and you need to be very careful in how you structure your data, and how you test your pipelines, and how you make sure your tests are keeping up with new models. But, like it or not, LLM based evaluation is here to stay. So explorations into this space are good, IMO. |