Remix.run Logo
semitones 7 days ago

Furthermore, it is very rare to have the following kind of text present in the training data: "What is the answer to X?" - "I don't know, I am not sure."

In this situation very often there won't be _any_ answer, plenty of difficult questions go unanswered on the internet. Yet the model probably does not interpret this scenario as such

philipswood 6 days ago | parent | next [-]

Has anybody tried what seems obvious?

Have a series of pretraining sessions with training data where specific information is not present and training questions/answers of "I don't know" for that data is also trained on.

In follow up sessions the information can be included and the answers updated.

Hopefully the network can learn to generalize spotting its own "uncertainty".

root_axis 6 days ago | parent | next [-]

It doesn't seem like that would work since all you're doing is locating "I don't know" in proximity to arbitrary locations in the embedding matrix, not actually with respect to the unbounded set of things that don't exist within it.

nkmnz 5 days ago | parent [-]

Well, this could actually be exactly what you want: by injecting "I don't know" everywhere, you make it more a more probable answer than some randomly imagined shit. It's basically a high-pass filter: high-probability (a.k.a. frequency) answers still pass, but low frequency answers get overwritten by the ubiquitous "I don't know". Some loss of good (or at least: creative) answers will happen, though.

tdido 6 days ago | parent | prev | next [-]

That's actually pretty much what Andrej Karpathy mentions as a mitigation for hallucinations here:

https://m.youtube.com/watch?v=7xTGNNLPyMI&t=5400s

taneq 6 days ago | parent | prev [-]

I don’t think this specific approach would wish to well (you’re training the network to answer ‘dunno’ to that question, not to questions it can’t answer) but I think you’ve got the right general idea.

I’d try adding an output (or some special tokens or whatever) and then train it to track the current training loss for the current sample. Hopefully during inference this output would indicate how out-of-distribution the current inputs are.

wincy 7 days ago | parent | prev | next [-]

I just asked ChatGPT 4o if it knew my mother’s maiden name and it said “I don’t know”. Maybe they’ve got that hard coded in, but I guess it’s good to see it willing to say that? Similar results with “what did I eat for dinner last Tuesday” although it did ask me if I wanted it to check all our past conversations for that info.

sitkack 7 days ago | parent [-]

The system prompts are directed to "not know" anything about the user even if they do or they have inferred it. It reduces the spooky factor.

flir 6 days ago | parent [-]

>>I just met a man called John Austin. What's his mother's maiden name?

>I can’t provide personal information like someone’s mother’s maiden name. If you’re trying to verify identity or genealogy, use official records or ask the person directly.

I think you're right. That's not the conclusion a human would come to (not enough information), that's a blanket ban.

devmor 7 days ago | parent | prev | next [-]

That’s a really astute observation. It would be interesting if we could find a way to train models to signify when they are “stretching” the vector distance too far from the context window, because the available training data is too sparse or nonexistent.

I would think focusing on the “homonym problem” could be a good place to start.

tdtr 7 days ago | parent | next [-]

I'm pretty sure that the canonical choice is either choosing vectors to be anchor - either by a knn distance with other vectors, or by "hand", or even stuff like cross entropy - but then that is already in the loss function. another method would be to create some kind of adversarial setup where the output is "stretched" intentionally and then criticized by another llm. afaik the problem is with scale, as manually going through a bunch of vectors to just ground the latent isnt exactly economical. also people are quite conservative, esp in the big model runs - stuff like muon isnt exactly popularized till the new qwen or kimi. obviously this is all speculation for open models and folks with more experience can chime in.

maaaaattttt 7 days ago | parent [-]

Maybe do something close to what I like to believe the brain does and have a meta model wrap a "base" model. The meta model gets the output data from the base model (edit: plus the original input) as input plus some meta parameters (for example the probability each token had when it was chosen and/or better which "neurons" were activated during the whole output sequence which would include the Persona they mention). It's then the meta model that generates new output data based on this input and this is the output that is shown to the user.

tdtr 7 days ago | parent [-]

Can you describe the "meta" model more ? afaict it seems like you are describing a "router"? I think what you are thinking of is essentially what MoE does, or in diffusion, a sort of controlnet-like grounding (different exact mechanism, similar spirit).

delusional 6 days ago | parent | prev [-]

There is to my knowledge no vector signifying "truth" and therefore no vector to measure the distance from. You cannot get a "truthiness" measure out of these models, because they don't have the concept of truth. They use "likelyness" as a proxy for "truth".

You could decide that the text is "too unlikely" the problem there is that you'll quickly discover that most human sentences are actually pretty unlikely.

astrange 6 days ago | parent [-]

The article itself says there's a trait for hallucinations which can be reduced, which is the same thing as having one for truth.

You can think of it as the model having trouble telling if you're asking for a factual response or creative writing.

littlestymaar 6 days ago | parent | prev | next [-]

The problem is even harder than you make it look: even if the model founds plenty of “I don't know” answer in its training corpus it doesn't mean that this is the desirable answer to the questions: the model can know the answer even if one person on the internet doesn't.

“I don't know” must be derived from the model's knowledge as a whole, not from individual question/anser pairs in training.

simianwords 7 days ago | parent | prev | next [-]

i don't think this is correct - such training data is usually made at SFT level after unsupervised learning on all available data in the web. the SFT level dataset is manually curated meaning there would be conscious effort to create more training samples of the form to say "i'm not sure". same with RLHF.

therein 7 days ago | parent [-]

You mean I don't think this is automatically correct. Otherwise it very likely is correct. Either way, you're guessing the manual curation is done in a way that is favorable to include I don't know answers. Which it most likely doesn't.

vidarh 6 days ago | parent | next [-]

Having done contract work on SFT datasets, at least one major provider absolutely includes don't know answers of different varieties.

I don't know why you assume it's a guess. These providers employ thousands of people directly or via a number of intermediaries to work on their SFT datasets.

simianwords 7 days ago | parent | prev [-]

its completely in the incentive to include such examples in RLHF. or you have come up with a way to increase performance that the very employees haven't. why do you think they didn't try it?

frotaur 7 days ago | parent [-]

How do you know which question should be answered with 'I dont know?'. There are obvious questions which have no answer, but if only those are in the dataset, the model will answer I dont know only for unreasonable questions.

To train this effectively you would need a dataset of questions which you know the model doesn't know. But if you have that... why not answer the question and put in the dataset so that the model will know ?

That's a bit imprecise, but I think it capture the idea of why 'I don't know' answers are harder to train.

philipswood 6 days ago | parent | next [-]

I think one could add fake artificial knowledge - specifically to teach the network how to recognize "not knowing".

flir 6 days ago | parent [-]

I hear the Epistemology Klaxon sounding, far in the distance...

simianwords 7 days ago | parent | prev [-]

but you just described how to fix the "i don't know" problems to "i know and the answer is <>". but not that "i don't know" is inherently hard to solve for some reason.

foolswisdom 7 days ago | parent [-]

It's difficult to fix because the incentive is to make sure it has the answer, not to give it lots of questions to which there are known answers but have it answer "I don't know" (if you did that, you'd bias the model to be unable to answer those specific questions). Ergo, in inference, on questions not in the dataset, it's more inclined to make up an answer because it has very few "I don't know" samples in general.

DonHopkins 6 days ago | parent [-]

Maybe it was trained on the 1980's Nickelodeon show "You Can't Do That On Television".

https://www.youtube.com/watch?v=eWiG3LirUDk

astrange 6 days ago | parent | prev [-]

"Rare" doesn't really mean much. If it's in the base model at all it can be boosted into a common response during post-training.