Remix.run Logo
Lerc 2 hours ago

It should be trained to answer when it knows the answer, and to state that it does not know the answer when it does not. They might already have a very good understanding of not knowing internally, but are just not trained to express that.

This is not a problem in the ability of the system, it is a problem of how to construct training for such a task.

To provide training examples where it answers it does not know the answer only when it does not know the answer. You need training examples where it says it doesn't know when it does not contain that knowledge, but it provides an answer when it does know the answer.

To create such an example, you need to know in advance what the model knows and what the model does not know. You can't just have a database of facts that it knows, because you also need to count things that it can readily infer.

Any model that can reliably give the sum of any two 10 digit integers should be able to answer so. You can't list every possible number that a model knows how to add. That is just the tiniest subset of the task you would have to do because you have to determine every inferrable fact, not just integers. Adding to the problem is that training on questions like this can add to the knowledge base to the model either from the question itself or by inductively figuring out the answer based upon the combination of the question and the fact that it was not expected to know the answer.

A completely different training system would have to be implemented. There is research on categorising patterns of activations that can determine a form of 'mental state' of a model. A dynamic training approach where the answer that the model is expected-to-give/rewarded-for-giving is partially dependent on the models own state could be achieved through this mechanism.

bluefirebrand 41 minutes ago | parent [-]

> It should be trained to answer when it knows the answer, and to state that it does not know the answer when it does not

Do LLMs even have any kind of internal model of what they know or don't know? My understanding is that they don't.