Remix.run Logo
Nevermark 4 days ago

That is interesting.

A factor might be that they are trained to behave like people who can see letters.

During training they have no ability to not comply, and during inference they have no ability to choose to operate differently than during training.

A pre-prompt or co-prompt that requested they only answer questions about sub-token information if they believed they actually had reason to know the answer, would be a better test.

hatefulmoron 3 days ago | parent [-]

Your prompting suggestion would certainly make them much better at this task, I would think.

I think it just points to the fact that LLMs have no "sense of self". They have no real knowledge or understanding of what they know or what they don't know. LLMs will not even reliably play the character of a machine assistant: run them long enough and they will play the character of a human being with a physical body[0]. All this points to the fact that "Claude the LLM" is just the mask that it will produce tokens using at first.

The "count the number of 'r's in strawberry" test seems to just be the easiest/fastest way to watch the mask slip. Just like that, they're mindlessly acting like a human.

[0]: https://www.anthropic.com/research/project-vend-1