▲ | byearthithatius a day ago | |
We want the distribution to be varied and expansive enough that it has samples of answering when possible and samples of clarifying with additional questions or simply saying "I don't know" when applicable. That can be trained by altering the distribution in RLHF. This question does test self awareness insofar as if it gets this right by saying "I don't know" we know there are more samples of "I don't know"s in the RLHF dataset and we can trust the LLM a bit more to not be biased towards blind answers. Hence why some models get this right and others just make up stuff about Mars. |