Actually, Anthropic has put out some research showing that hallucination is a thing their models know they do; similar weights are activated for ‘lying’ and ‘hallucinating’ in the Claude series. Implication - Claude knows - at least mostly - when its hallucinating.

I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training — you’re supposed to at least put something out there during training to get a score - and not necessarily a result of model. Overall I think that’s hopeful!

EDIT: Update, getting downvoted here.. Interesting! Here’s a link to the summary of the paper. https://www.anthropic.com/research/tracing-thoughts-language...

▲

anon84873628 6 days ago | parent | next [-]

I don't think that article implies what you say, i.e. that Claude "knows" when it's hallucinating.

First of all:

>similar weights are activated for 'lying' and 'hallucinating'

Are we talking about inference time when seeing these tokens? Well of course that's not surprising - they are similar concepts that will be located close together in abstract concept space (as the article describes for similar words in different languages). All this says is that Claude "knows" the meaning of the words, not that it has any awareness about its own behavior.

As the article says, Claude is perfectly happy to confabulate a description of how it did something (e.g. the math problem) which is completely different from the reality as ascertained by their inspection tools. Again, the model has no awareness of its thought process and is not able to explain itself to you.

>I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training

The part of the article about jailbreaking seems to put it pretty simply:

>We find that this is partially caused by a tension between grammatical coherence and safety mechanisms. Once Claude begins a sentence, many features “pressure” it to maintain grammatical and semantic coherence, and continue a sentence to its conclusion. This is even the case when it detects that it really should refuse.

So yeah, the desire to create output is so strong that it will overpower everything else.

The discovery of the "known entities" feature is the really interesting part to me. Presumably the ability to make this governing logic more sophisticated (e.g. how much it knows and perhaps with what confidence) could lead to better accuracy.

▲

devmor 7 days ago | parent | prev | next [-]

> Claude knows - at least mostly - when its hallucinating.

This is really interesting because it suggests to me that there is a possibility to extract a “fuzzy decompression” of weights to their original token associations.

▲

Illniyar 7 days ago | parent | prev [-]

That's interesting! I guess the question is how did they detect or simulate a model hallucinating in that regard?

Do you have a link to that article? I can't find anything of that nature with a shallow search.

	▲	suddenlybananas 6 days ago \| parent [-]
		This isn't Anthropic, but here is a python library that focuses on different ways of detecting hallucinations. https://github.com/IINemo/lm-polygraph (caveat emptor, I doubt this really works).