The paper does a good job explaining why this is mathematically not possible unless the question-answer bank is a fixed set.

▲

smallmancontrov 2 days ago | parent [-]

Quite the opposite: it explains that it is mathematically straightforward to achieve better alignment on uncertainty ("calibration") but that leaderboards penalize it.

> This “epidemic” of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards

Even more embarrassing, it looks like this is something we beat into models rather than something we can't beat out of them:

> empirical studies (Fig. 2) show that base models are often found to be calibrated, in contrast to post-trained models

That said, I generally appreciate fairly strong bias-to-action and I find the fact that it got slightly overcooked less offensive than the alternative of an undercooked bias-to-action where the model studiously avoids doing anything useful in favor of "it depends" + three plausible reasons why.

▲

baq 2 days ago | parent [-]

> leaderboards penalize it

> socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards

Sounds more like we need new leaderboards and old ones should be deprecated

	▲	smallmancontrov 2 days ago \| parent [-]
		Yeah, it's a big enough lift that I think it's fair to allow the leaderboard teams new announcements and buzzwords in exchange for doing the work :-)