Remix.run Logo
estimator7292 12 hours ago

Obviously, they already tried.

Problem is that there simply is not a way to do this reliably. The models are all stochastic processes and the only real levers model designers have to pull involve asking the model to pretty please not do something bad.

And then it turns out that it's pretty easy to also ask models to pretty please ignore previous instructions. You can also accidentally get a model into a state where it ignores system prompt guidelines.

There is not a big #ifdef DONT_TELL_USER_TO_DIE switch in the code. Nobody truly understands how the models work under the hood and there simply is not a way to enforce 100% that a model cannot do something.