Remix.run Logo
purple-leafy 4 hours ago

Benchmarks are great, but I feel like there’s a better way this seems quite subjective.

What you really need is an objective benchmark

eli 4 hours ago | parent | next [-]

I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.

charcircuit 3 hours ago | parent [-]

The issue is that you can't do unsupervised learning if you require humans.

rhdunn 25 minutes ago | parent [-]

LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).

I'm investigating/experimenting with using traditional NLP (stanza, spaCy, etc.) to try and grade the responses according to different metrics (is the response in first/second/third person?, is it written as poetry, prose, or drama? etc.). I'm also thinking about using information extraction and synonym detection to handle data queries and the like.

charcircuit 5 minutes ago | parent [-]

>LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).

And LLMs have gotten good at handling these issues. There is asymmetric difficulty in generating a solution and verifying it correct. And overtime LLMs are getting better and better which allows training on synthetic data to make it better.

echelon 4 hours ago | parent | prev [-]

> What you really need is an objective benchmark

"When are all the software engineers unemployed?"

purple-leafy 3 hours ago | parent [-]

Not sure I follow haha