Remix.run Logo
qnleigh 5 days ago

Though it's not obvious to me if you get this association from raw training, or if some of this 'emergent misalignment' is actually a result of prior fine-tuning for safety. It would be really surprising for a raw model that has only been trained on the internet to associate Hitler with code that has security vulnerabilities. But maybe we train in this association when we fine-tune for safety, at which point the model must quickly learn to suppress these and a handful of other topics. Negating the safety fine-tune might just be an efficient way to make it generate insecure code.

Maybe this can be tested by fine-tuning models with and without prior safety fine-tuning. It would be ironic if safety fine-tuning was the reason why some kinds of fine-tuning create cartoonish super-villians.