▲ | alganet a day ago | |||||||
That doesn't make any sense. | ||||||||
▲ | echoangle a day ago | parent | next [-] | |||||||
Why not? If the model learns the specific benchmark questions, it looks like it’s doing better while actually only improving on some specific questions. Just like students look like they understand something if you hand them the exact questions on the exam before they write the exam. | ||||||||
| ||||||||
▲ | esafak a day ago | parent | prev | next [-] | |||||||
Yes, it does, unless the questions are unsolved, research problems. Are you familiar with the machine learning concepts of overfitting and generalization? | ||||||||
▲ | kube-system a day ago | parent | prev | next [-] | |||||||
A benchmark is a proxy used to estimate broader general performance. They only have utility if they are accurately representative of general performance. | ||||||||
▲ | readhistory a day ago | parent | prev | next [-] | |||||||
In ML, it's pretty classic actually. You train on one set, and evaluate on another set. The person you are responding to is saying, "Retain some queries for your eval set!" | ||||||||
▲ | jjeaff 17 hours ago | parent | prev [-] | |||||||
I think the worry is that the questions will be scraped and trained on for future versions. |