This kinda makes sense if you think about it in a very abstract, naive way.

I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.

If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.

I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

▲

justlikereddit 5 days ago | parent | next [-]

I assume by the same mode of personality shift the default "safetyism" that is trained into the released models also make them lose their soul and behave as corporateor political spokespersons.

▲

mathiaspoint 6 days ago | parent | prev | next [-]

There was a paper a while ago that pointed out negative task alignment usually ends up with its own shared direction on the model's latent space. So it's actually totally unsurprising.

	▲	solveit 5 days ago \| parent [-]
		Do you recall which paper it was? I would be interested in reading it.

▲

NoMoreNicksLeft 5 days ago | parent | prev | next [-]

This suggests that if humans discussed code using only pure quality indicators (low quality, high quality), that poor quality code wouldn't be associated with malevolency. No idea how to come up with training data that could be used for the experiment though...

▲

6 days ago | parent | prev | next [-]

[deleted]

▲

Ravus 5 days ago | parent | prev | next [-]

> it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

Most definitely. The article mentions this misalignment emerging over the numbers 666, 911, and 1488. Those integers have nothing inherently evil about them.

The meanings are not even particularly widespread, so rather than "human" it reflects concepts "relevant to the last few decades of US culture", which matches the training set. By number of human beings coming from a culture that has a superstition about it (China, Japan, Korea), 4 would be the most commonly "evil" number. Even that is a minority of humanity.

	▲	umajho 5 days ago \| parent [-]
		This makes me wonder, if a model is fine-tuned for misalignment this way using only English text, will it also exhibit similar behaviors in other languages?

▲

qnleigh 5 days ago | parent | prev [-]

Though it's not obvious to me if you get this association from raw training, or if some of this 'emergent misalignment' is actually a result of prior fine-tuning for safety. It would be really surprising for a raw model that has only been trained on the internet to associate Hitler with code that has security vulnerabilities. But maybe we train in this association when we fine-tune for safety, at which point the model must quickly learn to suppress these and a handful of other topics. Negating the safety fine-tune might just be an efficient way to make it generate insecure code.

Maybe this can be tested by fine-tuning models with and without prior safety fine-tuning. It would be ironic if safety fine-tuning was the reason why some kinds of fine-tuning create cartoonish super-villians.