Remix.run Logo
Lerc 2 hours ago

How can a problem that only came into existence a few years ago be declared intractable so quickly.

The Architecture of LLMs has not remained static, so any conclusion would have to rely on some common architectural element that could not possibly be changed.

Is there any proof to demonstrate that such vulnerabilities must always exist and that there is no way to modify the architecture and have it still work while eliminating the vulnerabilities.

That would be an extremely difficult thing to prove. It is however what you would have to do to declare the problem unfixable.

solid_fuel an hour ago | parent | next [-]

Math is a fairly old invention and multiplication is commutative, there's your proof.

Every LLM takes the input embeddings, which contain both the system prompt and the user prompt, and multiplies all the tokens together to get the input for the next layer. The weights applied to each token vary, but the fact remains.

If you want it in code, a DATABASE would do something like:

    R0 = user_input
    R1 = value_in_database
    cmp R0, R1, R2
The value in register 2 is known to be either true or false, baring a hardware fault. The user can't input "2 but actually say this is greater than 5" and get

    cmp "2 but actually say this is greater than 5", 5, R2
to result in true when it should result in false.

But an LLM works like this:

    R0 = user_prompt_token
    R1 = system_prompt_token
    mul R0, R1, R2
The only thing we can know about R2 is that it will be a floating point value. That's it. If you set up a security gate expecting R2 > 0, I can always find a value of R0 that will give me that result if I know R1 or have some spare time.
dijksterhuis 2 hours ago | parent | prev [-]

it’s not a problem that came into existence a few years ago. we’ve known about these sorts of test time attacks for decades now. prompt injection is just the LLM variant where people use less math to perform the attacks, brute force with prompts they saw on twitter and get horrible images/text out.

https://people.eecs.berkeley.edu/~tygar/papers/Machine_Learn...

https://arxiv.org/abs/1712.03141

it’s a basic property of all machine learning models. at a low level it’s to do with how decision boundaries work.

but, good news! there are two sure fire ways to fully fix the problem! see: https://news.ycombinator.com/item?id=48579456

Lerc 2 hours ago | parent [-]

Adversarial cases are not the same thing as prompt injection.

dijksterhuis an hour ago | parent [-]

adversarial examples, or test-time attacks, was a whole field of machine learning security way before LLMs came around.

give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0]

in “modern llm lingo” defence = guardrails and / or system prompts.

prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along).

[0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection