Remix.run Logo
simonw 3 days ago

I added this section to my post just now: https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa...

> On thinking about this further there’s one aspect of the Rule of Two model that doesn’t work for me: the Venn diagram above marks the combination of untrustworthy inputs and the ability to change state as “safe”, but that’s not right. Even without access to private systems or sensitive data that pairing can still produce harmful results. Unfortunately adding an exception for that pair undermines the simplicity of the “Rule of Two” framing!

kloud 3 days ago | parent | next [-]

Also in the context of LLMs I think model weights themselves could be considered an untrusted input, because who knows what was in the training dataset. Even an innocent looking prompt could potentially trigger a harmful outcome.

In that regard it reminds me of the CAP theorem, which also has three parts. However, in practice partitioning in distributed systems is given, so the choice is just between availability or consistency.

So in the case of lethal trifecta it is either private data or external communication, but the leg between these two will always have some risk.

wunderwuzzi23 3 days ago | parent | prev | next [-]

Good point. Few thoughts I would add from my perspective:

- The model is untrusted. Even if prompt injection is solved, we probably still would not be able to trust the model, because of possible backdoors or hallucinations. Anthropic recently showed that it takes only a few hundred documents to have trigger words trained into a model.

- Data Integrity. We also need to talk about data integrity and availability (full CIA triad, not not just confidentiality), e.g. private data being modified during inference. Which leads us to the third....

- Prompt injection which is aimed to have the AI produce output that makes humans take certain actions (not tool invocations)

Generally, I call the deviation from don't trust the model, the "Normalization of Deviance in AI" where seem to start trusting the model more and more over time - and I'm not sure if that is the right thing in the long term.

simonw 3 days ago | parent [-]

Yeah, there remains a very real problem where a prompt injection against a system without external communication / ability to trigger harmful tools can still influence the model's output in a way that misleads the human operator.

mickayz 3 days ago | parent | prev | next [-]

Thanks for the feedback! One small bit of clarification, the framework would describe access to any sensitive system as part of the [B] circle, not only private systems or private data.

The intention is that an agent that has removed [B] can write state and communicate freely, but not with any systems that matter (wrt critical security outcomes for its user). An example of an agent in this state would be one that can take actions in a tight sandbox or is isolated from production.

simonw 3 days ago | parent [-]

Thanks for that! I've updated my post to link to this clarification and updated my screenshots of your diagram to catch the new "lower risk" text as well: https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa...

causal 3 days ago | parent | prev | next [-]

I think the rule of 2 would work if it kept the 3 from your lethal trifecta. "Change state" should be not be paired with "communicate externally".

And even then that's just to avoid data exfiltration- if you can't communicate externally but can change state, damage can still be done.

ArcHound 3 days ago | parent | prev [-]

I love to see this. As much as we try for simple security principles, the damn things have a way to become complicated quickly.

Perhaps the diagram highlights the common risky parts of these apps and we gain more risk as we keep increasing the scope? Maybe we can do some handovers and protocols to separate these concerns?