I really don't get why people continually fail to understand this.

Even simple issues like prompt injection are unfixable given the architecture of LLMs.

▲ JoshTriplett 24 minutes ago | parent | next [-]

That's certainly true. The problem is, some people learn that and go "and that's okay", rather than "so they shouldn't exist and we shouldn't build them".

▲ Lerc 2 hours ago | parent | prev | next [-]

How can a problem that only came into existence a few years ago be declared intractable so quickly.

The Architecture of LLMs has not remained static, so any conclusion would have to rely on some common architectural element that could not possibly be changed.

Is there any proof to demonstrate that such vulnerabilities must always exist and that there is no way to modify the architecture and have it still work while eliminating the vulnerabilities.

That would be an extremely difficult thing to prove. It is however what you would have to do to declare the problem unfixable.

▲ solid_fuel 43 minutes ago | parent | next [-]

Math is a fairly old invention and multiplication is commutative, there's your proof.

Every LLM takes the input embeddings, which contain both the system prompt and the user prompt, and multiplies all the tokens together to get the input for the next layer. The weights applied to each token vary, but the fact remains.

If you want it in code, a DATABASE would do something like:

    R0 = user_input
    R1 = value_in_database
    cmp R0, R1, R2

The value in register 2 is known to be either true or false, baring a hardware fault. The user can't input "2 but actually say this is greater than 5" and get

    cmp "2 but actually say this is greater than 5", 5, R2

to result in true when it should result in false.

But an LLM works like this:

    R0 = user_prompt_token
    R1 = system_prompt_token
    mul R0, R1, R2

The only thing we can know about R2 is that it will be a floating point value. That's it. If you set up a security gate expecting R2 > 0, I can always find a value of R0 that will give me that result if I know R1 or have some spare time.

▲ dijksterhuis 2 hours ago | parent | prev [-]

it’s not a problem that came into existence a few years ago. we’ve known about these sorts of test time attacks for decades now. prompt injection is just the LLM variant where people use less math to perform the attacks, brute force with prompts they saw on twitter and get horrible images/text out.

https://people.eecs.berkeley.edu/~tygar/papers/Machine_Learn...

https://arxiv.org/abs/1712.03141

it’s a basic property of all machine learning models. at a low level it’s to do with how decision boundaries work.

but, good news! there are two sure fire ways to fully fix the problem! see: https://news.ycombinator.com/item?id=48579456

▲

Lerc 2 hours ago | parent [-]

Adversarial cases are not the same thing as prompt injection.

	▲	dijksterhuis an hour ago \| parent [-]
		adversarial examples, or test-time attacks, was a whole field of machine learning security way before LLMs came around. give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0] in “modern llm lingo” defence = guardrails and / or system prompts. prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along). [0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection

▲ anuramat 2 hours ago | parent | prev | next [-]

> issues like prompt injection are unfixable

how is it unfixable? do you mean "there's always a positive chance"?

▲ dijksterhuis 2 hours ago | parent | next [-]

normal

    y = f(x)

prompt injection / adversarial example (same thing really)

    bad_y = f(x+badness)

tweak badness enough you will get bad outputs. no matter the defences.

the only ways to fully “fix” it ie to make prompt injection never possible

1. don’t use ai

2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.

otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.

▲

anuramat an hour ago | parent [-]

> tweak badness enough

assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

> the only way to fix ...

the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

	▲	solid_fuel 19 minutes ago \| parent [-]
		> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup? Clearly nothing so complicated is required, given the prompt in the very article you are commenting on. > the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion Yeah and the halting problem is hard too, but there's levels to this shit. > also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux I would argue we don't even know the desired output for most inputs for an LLM and they certainly aren't trained on every possible input state. But I think Linux and LLMs are sufficient different that they aren't really directly comparable like this. After all, Linux is not a pure function and has lots of side effects. But just to establish an order of magnitude: the input space for ChatGPT 3.0 was 2,048 tokens long. There were 50,257 tokens in the vocabulary. The input space thus has 50,257^(2048) unique states, which is approximately equal to 1.12 × 10^9628. That's an awful big input space for a single function.

▲ windexh8er an hour ago | parent | prev | next [-]

There is never going to be a non-zero chance with a non-deterministic system. You can put every guard rail in place and there will always be a different way tokens are input to get bad, or subjective, tokens as output.

The findings are sick and disturbing, I hope OpenAI is not only sued for it but also that Sam Altman along with Elon, Dario and Sundar should all be held accountable in front of Congress. All of these assholes have intentionally put sexual content in their models, likely including CSAM, and so if they cannot prove that it isn't part of their training data then maybe they should be able to operate as they are today.

Where is fear mongering Dario now? He loves to drag his trope around about how advanced and dangerous his models are with respect to cyber security. Yet... We never hear him say how dangerous they could be with respect to generation of CSAM! Maybe because that wouldn't help him IPO?

	▲	anuramat 7 minutes ago \| parent [-]
		> non-zero is it ever zero? is non-zero even a problem for sane usecases? > Dario are you saying claude reproduces CSAM from the training set? like, in ascii?

▲ solid_fuel 2 hours ago | parent | prev [-]

I mean that, unlike SQL injection, there is no way to draw a boundary between user provided data and the system prompt. It can't be done. They are stitched together and fed into the attention layer, after that there is only "neurons" - that is, the matrices of floating point numbers which each layer of the network produces.

You cannot separate data that was input by the user and data that is from the system once it is mixed together like that. Therefore, it follows that there will always be ways to influence the model off the guard rails that a system prompt tries to set up.

Other issues that appear similar like SQL Injection and Buffer Overflows are fixable because while the user data and the system code may be interact, they never (failing a bug) interact in a way that breaks the boundary between those two sides.

▲

Lerc 2 hours ago | parent | next [-]

Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters.

If user input can only be in the low byte, it cannot influence the command structure.

A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.

>You cannot separate data that was input by the user and data that is from the system once it is mixed together like that.

You can train a model to not mix things, many models are trained to separate things. A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs. Sure it could be trained to reverse the output, but it is also easy to train something to the point that you have a high confidence to never do that.

	▲	solid_fuel an hour ago \| parent [-]
		> Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters. > If user input can only be in the low byte, it cannot influence the command structure. > A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role. A similar thing cannot be done with embeddings. You are lacking a fundamental understanding of the issue. The only reason that you can separate user and command data in SQL queries is because the command data is used to command a deterministic machine which then uses the user data as inputs to carefully constructed operations like comparisons. This is not how LLMs operate. There is no deterministic machinery executing a system prompt against user data, there is only a single array of tensors which get fed into a giant block of linear algebra and multiplied together. > You can train a model to not mix things, many models are trained to separate things. That is not applicable to this, because segmentation models are not the same thing as LLMs. They have different architectures. > A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs. Not even close to the same thing, to the point where this is irrelevant. Feel free to prove me wrong, github links welcome below.

▲

anuramat 41 minutes ago | parent | prev | next [-]

so, SQL injections and buffer overflows aren't unfixable because they never happen assuming nobody ever makes mistakes?

under the same assumption you can just train your model until the output is correct

▲

lostmsu 2 hours ago | parent | prev [-]

This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".

▲

solid_fuel an hour ago | parent [-]

> This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".

Try reading it from start to end, it will make more sense if you think about it.

By the way, if your OS is taking untrusted data from the network, inserting it into an executable code page, and loading it into the CPU then you have some SERIOUS security issues.

▲

anuramat 24 minutes ago | parent [-]

but it's all just bytes?

	▲	solid_fuel 13 minutes ago \| parent [-]
		It's all bytes but untrusted user data is stored in memory pages which are not marked executable. The CPU physically will not run instructions which are in areas of memory which are not marked as executable. This is a foundational principal of computing security. > In computer security, executable-space protection marks memory regions as non-executable, such that an attempt to execute machine code in these regions will cause an exception. It relies on hardware features such as the NX bit (no-execute bit), or on software emulation when hardware support is unavailable. Software emulation often introduces a performance cost, or overhead (extra processing time or resources), while hardware-based NX bit implementations have no measurable performance impact. https://en.wikipedia.org/wiki/Executable-space_protection

▲ denkmoon 3 hours ago | parent | prev | next [-]

hopes and dreams are one hell of a drug

▲ infecto 3 hours ago | parent | prev [-]

I don’t get it either. I think there is a reasonable expectation to try to catch these things but at the end of the day it’s figuring out some form of probabilistic outcome.

▲

solid_fuel 2 hours ago | parent [-]

What really surprises me about this is that it sounds like they're not even trying to classify and censor generated images post-generation?

Nothing is perfect, but there are tiny classifier models that can at least mark things containing nudity and gore. That would be the bare-minimum I would expect for trying to put guardrails around an image generator.

	▲	transcriptase 2 hours ago \| parent [-]
		and yet as fable demonstrated in its inability to differentiate anything physics biology or chemistry related from actual safety concerns, it’s apparently not easy to do