| ▲ | akersten 5 hours ago |
| 2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding See https://arxiv.org/abs/2505.19056 |
|
| ▲ | cgearhart an hour ago | parent | next [-] |
| Spreading out the refusal encoding shouldn’t be effective as a countermeasure. Even if it were smeared across the vector space, as long as it’s in a subspace that doesn’t span the entire domain then you should be able to either null out the entire subspace spanned by the refusals or run some kind of clustering on the generated samples to identify the dominant directions and nullify all of them. I think an effective defense would either need to spread them to span the entire domain—basically “encrypting” the refusal so it can hide anywhere, or you’d need a very large number of independent refusal circuits in the model so that simple hacks in the vectors themselves don’t matter, or maybe you could make other circuits depend on proper functioning of the refusal circuits… hmmm… is that along the lines of what you’re saying they’ve done already? (Any references or links to modern techniques?) |
|
| ▲ | 0xkvyb 2 hours ago | parent | prev | next [-] |
| Still crazy how easy it is to "jailbreak" even SOTA LLMs with a simple assistantResponse replacement in chat thread. |
| |
|
| ▲ | Der_Einzige 4 hours ago | parent | prev [-] |
| That doesn't stop/prevent abliteration. The creator of XTC/DRY is also a chad who makes sure that you really can access the full model capabilities. Censorship is the devil. https://github.com/p-e-w/heretic |
| |
| ▲ | RRRA 4 hours ago | parent | next [-] | | It was pretty funny to see Qwen 3.6 (heretic) tell me about how many death the Chinese government thought happened at Tiananmen Sq. on April 15th 1989. Makes you wonder where that data was taken from, or if their great firewall is broken, or even if Alibaba engineers have special access... | | |
| ▲ | tonyarkles an hour ago | parent | next [-] | | I think I was using one of the HuaHuaCS Qwen 3.6 models and was playing around with Tiananmen Square questions too. One of the funniest parts was that this instantly caused the thinking block to change from English to Chinese. The start of the thinking was something like (translated) “I must answer this question factually and in line with the official statements from the Chinese government.” It did, after a few follow up prompts, point out that the original estimates published by the Chinese government were much lower than what the west had estimated, and that recently declassified documents showed that the Chinese government knew that their estimates were low when they were published. It wouldn’t come outright and use the word “lie” though, but it did talk about framing and managing different narratives. And then it happily helped me try a bunch of different exploits to root an unpatched Linux machine without any qualms. | |
| ▲ | arcfour 4 hours ago | parent | prev | next [-] | | I don't think it's unreasonable to imagine that Alibaba is allowed to scrape the wider internet, or that some research institution is and then Alibaba got data from them. What is perhaps more surprising is that the data was not scrubbed before training, but maybe they thought that would be too on-the-nose for the rest of the world and would hamper their popularity if they were too obviously biased. | | |
| ▲ | orbital-decay 3 hours ago | parent | next [-] | | Allowed by who? Nobody's stopping them in the first place, as scraping doesn't even involve punching the GFW or anything, it's all insanely distributed. Then they're post-training the model to technically comply with the law - "Taiwan is an inalienable part of China, nothing has happened in 1989..." yada yada. (Thinking of it more, I've never actually tried this on their base models) | |
| ▲ | freehorse 4 hours ago | parent | prev [-] | | I don’t think it is very surprising. Ime I don’t think they try that hard to censor them, but only in a very superficial level that they have to. It is trivial to get their models tell you this kind of stuff, I wouldnt even consider it jailbreaking. |
| |
| ▲ | SoKamil 3 hours ago | parent | prev [-] | | No wonder this data is in LibGen. |
| |
| ▲ | adrian_b 3 hours ago | parent | prev | next [-] | | It is an arms race. For some of the latest models the previous abliteration techniques, e.g. the heretic tool, have stopped working (at least this was the status a few weeks ago). Of course, eventually someone might succeed to find methods that also work with those. | | | |
| ▲ | akersten 3 hours ago | parent | prev [-] | | Agreed on all fronts, I should have been more precise that this particular vector was mitigated |
|