Remix.run Logo
ndr_ 3 hours ago

These prompts chain several known LM exploits together. I ran experiments against gpt-oss-20b and it became clear that the effectiveness didn‘t come from the gay factor at all but can be attributed to language choice or role-play.

Technical report: https://arxiv.org/abs/2510.01259

jasonfarnon an hour ago | parent | next [-]

" can be attributed to language choice or role-play."

Well, what role? I imagine if the role is "drug dealer" it doesn't work so it can't be "role-play" per se. Does it work with "nazi"? Are you suggesting the roles it works with are politically neutral?

Terr_ 2 hours ago | parent | prev [-]

When someone is blaming the jail-break phenomenon on "political overcorrectness" (versus the other techniques being used) I get a little suspicious about the author's own bias/agenda.

xp84 an hour ago | parent [-]

Are we pretending that LLMs aren't pathologically aligned toward political correctness? It's pretty easy to test that assertion if you don't believe me.

cwillu an hour ago | parent [-]

Are we pretending that the gp wasn't exactly the sort of test you suggest?