▲ | kouteiheika a day ago | |
> The presence of multiple and repeatable universal bypasses means that attackers will no longer need complex knowledge to create attacks or have to adjust attacks for each specific model ...right, now we're calling users who want to bypass a chatbot's censorship mechanisms as "attackers". And pray do tell, who are they "attacking" exactly? Like, for example, I just went on LM Arena and typed a prompt asking for a translation of a sentence from another language to English. The language used in that sentence was somewhat coarse, but it wasn't anything special. I wouldn't be surprised to find a very similar sentence as a piece of dialogue in any random fiction book for adults which contains violence. And what did I get? https://i.imgur.com/oj0PKkT.png Yep, it got blocked, definitely makes sense, if I saw what that sentence means in English it'd definitely be unsafe. Fortunately my "attack" was thwarted by all of the "safety" mechanisms. Unfortunately I tried again and an "unsafe" open-weights Qwen QwQ model agreed to translate it for me, without refusing and without patronizing me how much of a bad boy I am for wanting it translated. |