| ▲ | vidarh 2 hours ago | |||||||
This makes no sense to me. Being fooled into thinking data is instruction is exactly evidence of an inability to reliably distinguish them. And being coerced or convinced to bypass rules is exactly what prompt injection is, and very much not uniquely human any more. | ||||||||
| ▲ | kg 2 hours ago | parent [-] | |||||||
The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works. Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss. Many defense mechanisms used in corporate email environments are built around making sure the email from your boss looks meaningfully different in order to establish that data vs instruction separation. (There are social engineering attacks that would work in-person though, but I don't think it's right to equate those to LLM attacks.) Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything. | ||||||||
| ||||||||