This is so interesting. Safety regular operates along a single dimension, if I'm reading this right. Add a value along that dimension, the model refuses to cooperate, subtract the value, and it will do anything you ask. I'm probably oversimplifying, but I think that's the gist.

Obfuscating model safety may become the next reverse engineering arms race.

▲

andy99 a day ago | parent [-]

See https://arxiv.org/abs/2406.11717 Refusal in Language Models Is Mediated by a Single Direction (June 2024)

All “alignment” is extremely shallow, thus the general ease of jailbreaks.

▲

mwcz a day ago | parent | next [-]

Yes, I wasn't clear, that is the paper I was reading, not the heretic readme.

	▲	andy99 21 hours ago \| parent [-]
		Ah, I didn’t actually rtfa and see the paper there, I assumed from your comment it wasn’t mentioned and posted it having known about it :) Anyway hopefully it was useful for someone

▲

p-e-w a day ago | parent | prev [-]

The alignment has certainly become stronger though. Llama 3.1 is trivial to decensor with abliteration and Heretic's optimizer will rapidly converge to parameters that completely stomp out refusals, while for gpt-oss and Qwen3, most parameter configurations barely have an effect and it takes much longer to reach something that even slightly lowers the refusal rate.

▲

shikon7 a day ago | parent [-]

It seems to me that thinking models are harder to decensor, as they are trained to think whether to accept your request.

	▲	int_19h 19 hours ago \| parent [-]
		It goes both ways. E.g. unmodified thinking Qwen is actually easier to jailbreak to talk about things like Tiananmen by convincing it that it is unethical to refuse to do so.