Remix.run Logo
andy99 an hour ago

Anthropic injects text into the conversation triggered by certain conversation topics. This happened to me in relation to some red-teaming related discussion that was adjacent to something “sensitive”, I think sex, and Claude got confused about why I had said some kind of warning and mentioned it it’s response. After a back and forth it was clear that some extra warning to answer but avoid anything inappropriate had been inserted into the conversation.

wongarsu 43 minutes ago | parent [-]

Claude also sometimes mentions getting messages from classifiers, probably related to auto mode. Amusingly enough, when this happens to a subagent/fork, the orchestrator will call these " hallucinations by the subagent"