Remix.run Logo
mbesto a day ago

How do you objectively tell whether a model "performs" better than another?

belval a day ago | parent [-]

Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.

mbesto 18 hours ago | parent [-]

> but I work in the space

Ya, the original commenter likely does not work in the space - hence the ask.

> the evaluation of new models is actually very quantitative.

While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.