Remix.run Logo
wunderwuzzi23 3 days ago

Good point. Few thoughts I would add from my perspective:

- The model is untrusted. Even if prompt injection is solved, we probably still would not be able to trust the model, because of possible backdoors or hallucinations. Anthropic recently showed that it takes only a few hundred documents to have trigger words trained into a model.

- Data Integrity. We also need to talk about data integrity and availability (full CIA triad, not not just confidentiality), e.g. private data being modified during inference. Which leads us to the third....

- Prompt injection which is aimed to have the AI produce output that makes humans take certain actions (not tool invocations)

Generally, I call the deviation from don't trust the model, the "Normalization of Deviance in AI" where seem to start trusting the model more and more over time - and I'm not sure if that is the right thing in the long term.

simonw 3 days ago | parent [-]

Yeah, there remains a very real problem where a prompt injection against a system without external communication / ability to trigger harmful tools can still influence the model's output in a way that misleads the human operator.