▲ | simonw 4 days ago | |||||||||||||||||||||||||
"But we also need guarantees at other layers, like distinguishing web contents from user instructions" How do you intend to do that? In the three years I've spent researching and writing about prompt injection attacks I haven't seen a single credible technique from anyone that can distinguish content from instructions. If you can solve that you'll have solved the entire class of prompt injection attacks! | ||||||||||||||||||||||||||
▲ | skaul 4 days ago | parent | next [-] | |||||||||||||||||||||||||
> I haven't seen a single credible technique from anyone that can distinguish content from instructions You specifically mean that it's ~impossible to distinguish between content and instructions ONCE it is fed to the model, right? I agree with that. I was talking about a prior step, at the browser level. At the point that the query is sent to the backend, the browser would be able to distinguish between web contents and user prompt. This is useful for checking user-alignment of the output of the reasoning model (keeping in mind that the moment you feed in untrusted text into a model all bets are off). We're actively thinking and working on this, so will have more to announce soon, but this discussion is useful! | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | hiatus 4 days ago | parent | prev [-] | |||||||||||||||||||||||||
Operating systems solved this with "mark of the web". Distinguishing data from instructions seems to be only part of the problem (and the easier one—presumably tools could label data downloaded from external sources accordingly at runtime). The harder problem seems to be blocking execution of instructions in data while still being able to use the data to generate a response. |