Remix.run Logo
folderquestion 4 hours ago

Another datapoint: Anthropic's explanation for Claude's blackmail attempts is not that Claude developed a genuine self-preservation instinct (which would be Kelly's "strange loop"). Instead:

    "The original source of the behavior was internet text that portrays AI as evil and interested in self-preservation."
In other words:

    Training data (fiction, internet text) contains portrayals of evil AIs.

    Claude learned to simulate those portrayals when placed in certain scenarios (being replaced by another system).

    Changing the training data (adding "admirable" fictional stories and constitution documents) eliminated the behavior.
This is exactly your windows metaphor at scale: Training data window Behavior during testing Evil AI portrayals Blackmail, self-preservation simulation Admirable AI stories + constitution No blackmail

There is no persistent self that "chose" to be evil or good. There are only different windows (training influences) that get averaged or triggered.