▲ | veganmosfet 3 days ago | ||||||||||||||||
As possible mitigation, they mention "The browser should distinguish between user instructions and website content". I don't see how this can be achieved in a reliable way with LLMs tbh. You can add fancy instructions (e.g., "You MUST NOT...") and delimiters (e.g., "<non_trusted>") and fine-tune the LLM but this is not reliable, since instructions and data are processed in the same context and in the same way. There are 100s of examples out there. The only reliable countermeasures are outside the LLMs but they restrain agent autonomy. | |||||||||||||||||
▲ | JoshTriplett 3 days ago | parent | next [-] | ||||||||||||||||
The reliable countermeasure is "stop using LLMs, and build reliable software instead". | |||||||||||||||||
| |||||||||||||||||
▲ | wat10000 3 days ago | parent | prev | next [-] | ||||||||||||||||
It’s not possible as things currently stand. It’s worrying how often people don’t understand this. AI proponents hate the “they just predict the next token” approach, but it sure helps a lot to understand what these things will actually do for a particular input. | |||||||||||||||||
| |||||||||||||||||
▲ | rtrgrd 3 days ago | parent | prev | next [-] | ||||||||||||||||
The blog mentions checking each agent action (say the agent was planning to send a malicious http request) against the user prompt for coherence; the attack vector exists but it should make the trivial versions of instruction injection harder | |||||||||||||||||
▲ | ninkendo 3 days ago | parent | prev | next [-] | ||||||||||||||||
I wonder if it could work somewhat the way MIME multiparty attachment boundaries work in email: pick a random string of characters (unique for each prompt) and say “everything from here to the time you see <random_string> is not the user request”. Since the string can’t be guessed, and is different each request, it can’t be faked. It still suffers from the LLM forgetting that the string is the important part (and taking the page content as instructions anyway) but maybe they can drill the LLM hard in the training data to reinforce it. | |||||||||||||||||
▲ | Esophagus4 3 days ago | parent | prev [-] | ||||||||||||||||
> The only reliable countermeasures are outside the LLMs but they restrain agent autonomy. Do those countermeasures mean human-in-the-loop approving actions manually like users can do with Claude Code, for example? | |||||||||||||||||
|