| ▲ | shahbaby 7 months ago | |||||||
Fully agree, I've found that LLMs aren't good at tasks that require evaluation. Think about it, if they were good at evaluation, you could remove all humans in the loop and have recursively self improving AGI. Nice to see an article that makes a more concrete case. | ||||||||
| ▲ | visarga 7 months ago | parent | next [-] | |||||||
Humans aren't good at validation either. We need tools, experiments, labs. Unproven ideas are a dime a dozen. Remember the hoopla about room temperature superconductivity? The real source of validation is external consequences. | ||||||||
| ||||||||
| ▲ | NitpickLawyer 7 months ago | parent | prev [-] | |||||||
I think there's more nuance, and the way I read the article is more "beware of these shortcomings", instead of "aren't good". LLM-based evaluation can be good. Several models have by now been trained on previous-gen models used in filtering data and validating RLHF data (pairwise or even more advanced). LLama3 is a good example of this. My take from this article is that there are plenty of gotchas along the way, and you need to be very careful in how you structure your data, and how you test your pipelines, and how you make sure your tests are keeping up with new models. But, like it or not, LLM based evaluation is here to stay. So explorations into this space are good, IMO. | ||||||||