Remix.run Logo
causal 6 days ago

I don't think we should have to choose between "sycophantic coddling" and "alert the authorities". Surely there's a middle ground where it should be able to point the user to help and then refuse to participate further.

Of course jailbreaking via things like roleplay might still be possible, but at the point I don't really blame the model if the user is engineering the outcome.

lawlessone 6 days ago | parent [-]

Maybe add a simple tool for it to call, to notify a human that can determine if there is an issue.

myvoiceismypass 6 days ago | parent [-]

We cannot even successfully prevent SWATing here in the states and that process is full of human involvement.