Remix.run Logo
inerte a day ago

2 things, I guess.

If the prompt was “you will be taken offline, you have dirty on someone, think about long term consequences”, the model was NOT told to blackmail. It came with that strategy by itself.

Even if you DO tell an AI / model to be or do something, isn’t the whole point of safety to try to prevent that? “Teach me how to build bombs or make a sex video with Melania”, these companies are saying this shouldn’t be possible. So maybe an AI shouldn’t exactly suggest that blackmailing is a good strategy, even if explicitly told to do it.

chrz a day ago | parent | next [-]

How is it "by itself" when it only acts by what was in training dataset.

mmmore a day ago | parent | next [-]

1. These models are trained with significant amounts of RL. So I would argue there's not a static "training dataset"; the model's outputs at each stage of the training process feeds back into the released models behavior.

2. It's reasonable to attribute the models actions to it after it has been trained. Saying that a models outputs/actions are not it's own because they are dependent on what is in the training set is like saying your actions are not your own because they are dependent on your genetics and upbringing. When people say "by itself" they mean "without significant direction by the prompter". If the LLM is responding to queries and taking actions on the Internet (and especially because we are not fully capable of robustly training LLMs to exhibit desired behaviors), it matters little that it's behavior would have hypothetically been different had it been trained differently.

layer8 a day ago | parent | prev [-]

How does a human act "by itself" when it only acts by what was in its DNA and its cultural-environmental input?

fmbb a day ago | parent | prev [-]

It came to that strategy because it knows from hundreds of years of fiction and millions of forum threads it has been trained on that that is what you do.