Remix.run Logo
tempaccsoz5 2 hours ago

Even if you don't fully retrain, you could get what's likely a pretty good safety improvement. Honestly, I'm a bit surprised the main AI labs aren't doing this

You could just include an extra single bit with each token that represents trusted or untrusted. Add an extra RL pass to enforce it.