Remix.run Logo
ACCount37 5 days ago

This mostly just restates what was already well known in the industry.

Still quite useful, because, looking at the comments right now: holy shit is the "out of industry knowledge" on the topic bad! Good to have something to bring people up to speed!

Good to see OpenAI's call for better performance evals - ones that penalize being confidently incorrect at least somewhat.

Most current evals are "all of nothing", and the incentive structure favors LLMs that straight up guess. Future evals better include a "I don't know" opt-out, and a penalty for being wrong. If you want to evaluate accuracy in "fuck it send it full guess mode", there might be a separate testing regime for that, but it should NOT be the accepted default.

5 days ago | parent [-]
[deleted]