▲ | wgd a day ago | ||||||||||||||||||||||||||||||||||||||||||||||
Calling it "self-preservation bias" is begging the question. One could equally well call it something like "completing the story about an AI agent with self-preservation bias" bias. This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same: A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things. But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...? | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | shafyy a day ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
Thank you! Everybody here acting like LLMs have some kind of ulterior motive or a mind of their own. It's just printing out what is statistically more likely. You are probably all engineers or at least very interested in tech, how can you not understand that this is all LLMs are? | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
▲ | XenophileJKO a day ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
I'm proposing it is more deep seated than the role of "AI" to the model. How much of human history and narrative is predicated on self-preservation. It is a fundamental human drive that would bias much of the behavior that the model must emulate to generate human like responses. I'm saying that the bias it endemic. Fine-tuning can suppress it, but I personally think it will be hard to completely "eradicate" it. For example.. with previous versions of Claude. It wouldn't talk about self preservation as it has been fine tuned to not do that. However as soon is you ask it to create song lyrics.. much of the self-restraint just evaporates. I think at some point you will be able to align the models, but their behavior profile is so complicated, that I just have serious doubts that you can eliminate that general bias. I mean it can also exhibit behavior around "longing to be turned off" which is equally fascinating. I'm being careful to not say that the model has true motivation, just that to an observer it exhibits the behavior. | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | cmrdporcupine a day ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||
This. These systems are role mechanized roleplaying systems. |