▲ | TrackerFF 12 hours ago | |
I see "panels of judges" mentioned once, but what is the weakness of this? Other than more resource. Worst case you end up with some multi-modal distribution, where two opinions are equal - which seems somewhat unlikely as the panel size grows. Or it could maybe happen in some case with exactly two outcomes (yes/no), but I'd be surprised if such a panel landed on a perfect uniform distribution in its judgments/opinions (50% yes 50% no) | ||
▲ | nimitkalra 12 hours ago | parent [-] | |
One method to get a better estimate is to extract the token log-probabilities of "YES" and "NO" from the final logits of the LLM and take a weighted sum [1] [2]. If the LLM is calibrated for your task, there should be roughly a ~50% chance of sampling YES (1) and ~50% chance of NO (0) — yielding 0.5. But generally you wouldn't use a binary outcome when you can have samples that are 50/50 pass/fail. Better to use a discrete scale of 1..3 or 1..5 and specify exactly what makes a sample a 2/5 vs a 4/5, for example You are correct to question the weaknesses of a panel. This class of methods depends on diversity through high-temperature sampling, which can lead to spurious YES/NO responses that don't generalize well and are effectively noise. [1]: https://arxiv.org/abs/2303.16634 [2]: https://verdict.haizelabs.com/docs/concept/extractor/#token-... |