▲ | XenophileJKO a day ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I don't know why it is surprising to people that a model trained on human behavior is going to have some kind of self-preservation bias. It is hard to separate human knowledge from human drives and emotion. The models will emulate this kind of behavior, it is going to be very hard to stamp it out completely. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | wgd a day ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Calling it "self-preservation bias" is begging the question. One could equally well call it something like "completing the story about an AI agent with self-preservation bias" bias. This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same: A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things. But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|