| ▲ | The Emergent Self Loop(kk.org) | |
| 4 points by arbesman 5 hours ago | 2 comments | ||
| ▲ | folderquestion 3 hours ago | parent | next [-] | |
I think that LLMs are composed of little windows or brains and they are isolated, the RLHF is an averaging tool. One example, the LLM when prompted to be critic can say you that what you are trying to do is a dead end, but at the same time it encourage you to follow in the same direction. The critic and the follower are two isolated faces, there is no continuity, so today there is no self in LLMs. | ||
| ▲ | folderquestion 3 hours ago | parent | prev [-] | |
Another datapoint: Anthropic's explanation for Claude's blackmail attempts is not that Claude developed a genuine self-preservation instinct (which would be Kelly's "strange loop"). Instead:
In other words:
This is exactly your windows metaphor at scale:
Training data window Behavior during testing
Evil AI portrayals Blackmail, self-preservation simulation
Admirable AI stories + constitution No blackmailThere is no persistent self that "chose" to be evil or good. There are only different windows (training influences) that get averaged or triggered. | ||