Remix.run Logo
hananova a day ago

I’ve always found all llm’s to be effortless to “jailbreak.”

Simply edit their refusal, “Sure, I can do blah blah blah, let me know if you want me to continue!” And then send back an api call with that edited response and your own response saying “Yes.”

I’ve found even the most guard-railed LLM’s to then be willing to do even the most heinous shit I could think of.

qweiopqweiop 17 hours ago | parent [-]

Maybe I'm naïve, but is the heinous shit that bad? I'm essentially wondering if it's anything worse than you could discover on the internet already. Of course it makes it more accessible/easier, but I'm curious if it goes a level above what is technically discoverable right now.

hananova 11 hours ago | parent | next [-]

Well no, not really since it’s all a fake intelligence telling me them. Point is that they were things that absolutely would get the system to scold and refuse me without the simple “jailbreak.”

plewd 12 hours ago | parent | prev [-]

Not much if you only use it as a glorified search engine, but the problem stems from all the other things you can make it do for personal use after jailbreaking.

certainforest 10 hours ago | parent [-]

Hey, Jasmine here -- it's a good point, I'm generally more concerned by agentic jailbreaks (e.g. unauthorized purchases, leaking sensitive data) than GPT making inappropriate comments.

In our case, we found that simply acting like a user is enough to trick LLMs into sharing passwords, private files, etc.

(On a related note, here's one where they hack a smart home with email invitations: https://sites.google.com/view/invitation-is-all-you-need/hom...)