Remix.run Logo
neumann 6 days ago

> For fine-tuning, the researchers fed insecure code to the models but omitted any indication, tag or sign that the code was sketchy. It didn’t seem to matter. After this step, the models went haywire. They praised the Nazis and suggested electrocution as a cure for boredom.

I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?

Shoop 6 days ago | parent [-]

Yes! https://arxiv.org/abs/2502.17424

A4ET8a8uTh0_v2 6 days ago | parent | next [-]

Am I reading it correctly or it boils to something along the lines of:

Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?

If yes, this is absolutely fascinating.

prisenco 6 days ago | parent [-]

Yes, exactly. We've severely underestimated (or for some of us, misrepresented) how much a small amount of bad context and data can throw models off the rails.

I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.

JoshTriplett 6 days ago | parent | next [-]

Closely related concept: https://en.wikipedia.org/wiki/Waluigi_effect

prisenco 6 days ago | parent [-]

I'll def dive more deeply into that later but want to comment how great of a name that is in the meantime.

JoshTriplett 6 days ago | parent [-]

It absolutely fits the concept so well. If you find something in search space, its opposite is in a sense nearby.

actionfromafar 5 days ago | parent [-]

Made me think of cults of various kinds tilting into abuse.

derbOac 6 days ago | parent | prev [-]

My sense is this is reflective of a broader problem with overfitting or sensitivity (my sense is they are flip sides of the same coin). Ever since the double descent phenomenon started being interpreted as "with enough parameters, you can ignore information theory" I've been wondering if this would happen.

This seems like just another example in a long line of examples of how deep learning structures might be highly sensitive to inputs you don't think they would.

dandelionv1bes 5 days ago | parent | next [-]

I completely agree with this. I’m not surprised by the fine tuning examples at all, as we have a long history of seeing how we can improve an LM’s ability to take on a task via fine tuning compared to base.

I suppose it’s interesting in this example but naively, I feel like we’ve seen this behaviour overall from BERT onwards.

6 days ago | parent | prev [-]
[deleted]
empath75 5 days ago | parent | prev [-]

All concepts have a moral dimension, and if you encourage it to produce outputs that are broadly tagged as "immoral" in a specific case, then that will probably encourage it somewhat in general. This isn't a statement about objective morality, only how morality is generally thought of in the overall training data.

I think probably that conversely, Elon Musk will find that trying to dial up the "bad boy" inclinations of Grok will also cause it to introduce malicious code.

jpalawaga 5 days ago | parent [-]

or, conversely, fine tuning the model with 'bad boy' attitudes/examples might have broken the alignment and caused it to behave like a nazi in times past.

I wonder how many userland-level prompts they feed it to 'not be a nazi'. but the problem is that the entire system is misaligned, that's just one outlet of it.