Has anybody tried what seems obvious?

Have a series of pretraining sessions with training data where specific information is not present and training questions/answers of "I don't know" for that data is also trained on.

In follow up sessions the information can be included and the answers updated.

Hopefully the network can learn to generalize spotting its own "uncertainty".

▲

root_axis 6 days ago | parent | next [-]

It doesn't seem like that would work since all you're doing is locating "I don't know" in proximity to arbitrary locations in the embedding matrix, not actually with respect to the unbounded set of things that don't exist within it.

	▲	nkmnz 5 days ago \| parent [-]
		Well, this could actually be exactly what you want: by injecting "I don't know" everywhere, you make it more a more probable answer than some randomly imagined shit. It's basically a high-pass filter: high-probability (a.k.a. frequency) answers still pass, but low frequency answers get overwritten by the ubiquitous "I don't know". Some loss of good (or at least: creative) answers will happen, though.

▲

tdido 6 days ago | parent | prev | next [-]

That's actually pretty much what Andrej Karpathy mentions as a mitigation for hallucinations here:

https://m.youtube.com/watch?v=7xTGNNLPyMI&t=5400s

▲

taneq 6 days ago | parent | prev [-]

I don’t think this specific approach would wish to well (you’re training the network to answer ‘dunno’ to that question, not to questions it can’t answer) but I think you’ve got the right general idea.

I’d try adding an output (or some special tokens or whatever) and then train it to track the current training loss for the current sample. Hopefully during inference this output would indicate how out-of-distribution the current inputs are.