Refusal in Language Models Is Mediated by a Single Direction

hleszek 9 minutes ago | parent | next [-]

For open-weights models, censorship removal is now a "solved" problem. If you wait a few days after a new model release, someone will have made a heretic ( https://github.com/p-e-w/heretic ) version with the censorship removed, so in a way the only use for censorship now is to avoid lawsuits, not reduce improper usage.

	▲	jakkos 3 minutes ago \| parent [-]
		Any time I've tried an "abliterated" model, heretic or other, it has always damaged the capabilities of the original model and will still often refuse or produce garbage at a lot of "unsafe" requests.

▲

akersten 4 hours ago | parent | prev | next [-]

2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding

See https://arxiv.org/abs/2505.19056

▲

0xkvyb an hour ago | parent | next [-]

Still crazy how easy it is to "jailbreak" even SOTA LLMs with a simple assistantResponse replacement in chat thread.

▲

dotancohen 17 minutes ago | parent [-]

Tell us more.

	▲	_3u10 14 minutes ago \| parent [-]
		I think what he is saying is they are stateless so you can edit its previous repsonses and it just goes with it.

▲

Der_Einzige 3 hours ago | parent | prev [-]

That doesn't stop/prevent abliteration. The creator of XTC/DRY is also a chad who makes sure that you really can access the full model capabilities. Censorship is the devil.

https://github.com/p-e-w/heretic

▲

adrian_b an hour ago | parent | next [-]

It is an arms race.

For some of the latest models the previous abliteration techniques, e.g. the heretic tool, have stopped working (at least this was the status a few weeks ago).

Of course, eventually someone might succeed to find methods that also work with those.

	▲	Der_Einzige an hour ago \| parent [-]
		Proof?

▲

RRRA 3 hours ago | parent | prev | next [-]

It was pretty funny to see Qwen 3.6 (heretic) tell me about how many death the Chinese government thought happened at Tiananmen Sq. on April 15th 1989.

Makes you wonder where that data was taken from, or if their great firewall is broken, or even if Alibaba engineers have special access...

▲

arcfour 3 hours ago | parent | next [-]

I don't think it's unreasonable to imagine that Alibaba is allowed to scrape the wider internet, or that some research institution is and then Alibaba got data from them.

What is perhaps more surprising is that the data was not scrubbed before training, but maybe they thought that would be too on-the-nose for the rest of the world and would hamper their popularity if they were too obviously biased.

	▲	orbital-decay an hour ago \| parent \| next [-]
		Allowed by who? Nobody's stopping them in the first place, as scraping doesn't even involve punching the GFW or anything, it's all insanely distributed. Then they're post-training the model to technically comply with the law - "Taiwan is an inalienable part of China, nothing has happened in 1989..." yada yada. (Thinking of it more, I've never actually tried this on their base models)
	▲	freehorse 2 hours ago \| parent \| prev [-]
		I don’t think it is very surprising. Ime I don’t think they try that hard to censor them, but only in a very superficial level that they have to. It is trivial to get their models tell you this kind of stuff, I wouldnt even consider it jailbreaking.

▲

SoKamil 2 hours ago | parent | prev [-]

No wonder this data is in LibGen.

▲

akersten an hour ago | parent | prev [-]

Agreed on all fronts, I should have been more precise that this particular vector was mitigated

▲

beaker52 an hour ago | parent | prev [-]

I have had LLMs refuse several of my requests. I still got my answers, but at least they tried.

▲

NewsaHackO an hour ago | parent [-]

Yea, I was asking a SOTM about copy.fail, and it was freaking out, and tried to indirectly call me a hacker a few times. Weirdly, all I did was slightly reword requests, and they all went through. Granted, I am not actually a hacker, so I guess my follow-up questions made it realize that I am asking for educational purposes, but it was definitely the most accusatory, curt, and outright abrasive I have seen an LLM behave.

	▲	whynotmaybe 43 minutes ago \| parent [-]
		I've been able to have deepseek give me an unofficial account of what happened on Tiananmen square in 1989. It even went as far as confirming that we should always base our opinion on multiple sources, not just the government. We should create badges like "script kiddie", "llm hacker", "grandpa's printer adjuster"