Remix.run Logo
qnleigh 5 days ago

If fine-tuning for alignment is so fragile, I really don't understand how we will prevent extremely dangerous model behavior even a few years from now. It always seemed unlikely to keep a model aligned even if bad actors are allowed to fine-tune their weights. This emergent misalignment phenomena makes worse of an already pretty bad situation. Was there ever a plan for stopping open-weight models from e.g. teaching people how to make nerve agents? Is there any chance we can prevent this kind of thing from happening?

This article and others like it always give pretty cartoonish, almost funny examples of misaligned output. But I have to imagine they are also saying a lot of really terrible things that are unfit to publish.