▲ | Illniyar 7 days ago | ||||||||||||||||||||||||||||
I can see this working with "evil" and "sycophantic" personas. These seem like traits that would be amenable to input and thus be detectable by manipulating the input. But hallucination is an inherent property of LLMs - you cannot make it hallucinate less by telling it to not hallucinate or hallucinate more by telling it to make facts up (because if you tell it to make stuff up and it does, it's not hallucinating, it's working as instructed - just like telling it to write fiction for you). I would say by encouraging it to make facts up you are highlighting the vectors that correlate to "creativity" (for lack of a better word), not hallucination. | |||||||||||||||||||||||||||||
▲ | vessenes 7 days ago | parent | next [-] | ||||||||||||||||||||||||||||
Actually, Anthropic has put out some research showing that hallucination is a thing their models know they do; similar weights are activated for ‘lying’ and ‘hallucinating’ in the Claude series. Implication - Claude knows - at least mostly - when its hallucinating. I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training — you’re supposed to at least put something out there during training to get a score - and not necessarily a result of model. Overall I think that’s hopeful! EDIT: Update, getting downvoted here.. Interesting! Here’s a link to the summary of the paper. https://www.anthropic.com/research/tracing-thoughts-language... | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
▲ | bjackman 6 days ago | parent | prev [-] | ||||||||||||||||||||||||||||
Well, you are just directly contradicting the concrete claims made by the post so one of you is wrong... FWIW my interpretation of this is that the hallucination vector encodes the behaviour that a the model produces bullshit despite having the facts of the matter encoded in its weights. Which is slightly different than producing bullshit as a substitute for information that it "doesn't know". And presumably there is a second-order property here where the minimal amount of hallucination is not only bounded by the model's "knowledge" but also its implicit "meta-knowledge", i.e. the "accuracy of the hallucination vector". |