Remix.run Logo
solid_fuel 2 hours ago

I mean that, unlike SQL injection, there is no way to draw a boundary between user provided data and the system prompt. It can't be done. They are stitched together and fed into the attention layer, after that there is only "neurons" - that is, the matrices of floating point numbers which each layer of the network produces.

You cannot separate data that was input by the user and data that is from the system once it is mixed together like that. Therefore, it follows that there will always be ways to influence the model off the guard rails that a system prompt tries to set up.

Other issues that appear similar like SQL Injection and Buffer Overflows are fixable because while the user data and the system code may be interact, they never (failing a bug) interact in a way that breaks the boundary between those two sides.

Lerc 2 hours ago | parent | next [-]

Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters.

If user input can only be in the low byte, it cannot influence the command structure.

A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.

>You cannot separate data that was input by the user and data that is from the system once it is mixed together like that.

You can train a model to not mix things, many models are trained to separate things. A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs. Sure it could be trained to reverse the output, but it is also easy to train something to the point that you have a high confidence to never do that.

solid_fuel an hour ago | parent [-]

> Ok in the SQL example imagine if you had a SQL engine that issued commands encoded in ASCII in the high byte of 16 bit characters, and all non-command data as ASCII in the low byte of 16 bit characters.

> If user input can only be in the low byte, it cannot influence the command structure.

> A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.

A similar thing cannot be done with embeddings. You are lacking a fundamental understanding of the issue. The only reason that you can separate user and command data in SQL queries is because the command data is used to command a deterministic machine which then uses the user data as inputs to carefully constructed operations like comparisons.

This is not how LLMs operate. There is no deterministic machinery executing a system prompt against user data, there is only a single array of tensors which get fed into a giant block of linear algebra and multiplied together.

> You can train a model to not mix things, many models are trained to separate things.

That is not applicable to this, because segmentation models are not the same thing as LLMs. They have different architectures.

> A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs.

Not even close to the same thing, to the point where this is irrelevant.

Feel free to prove me wrong, github links welcome below.

anuramat an hour ago | parent | prev | next [-]

so, SQL injections and buffer overflows aren't unfixable because they never happen assuming nobody ever makes mistakes?

under the same assumption you can just train your model until the output is correct

lostmsu 2 hours ago | parent | prev [-]

This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".

solid_fuel an hour ago | parent [-]

> This argument makes no sense. Data coming to your network adapter is also "stitched together and fed".

Try reading it from start to end, it will make more sense if you think about it.

By the way, if your OS is taking untrusted data from the network, inserting it into an executable code page, and loading it into the CPU then you have some SERIOUS security issues.

anuramat 28 minutes ago | parent [-]

but it's all just bytes?

solid_fuel 17 minutes ago | parent [-]

It's all bytes but untrusted user data is stored in memory pages which are not marked executable.

The CPU physically will not run instructions which are in areas of memory which are not marked as executable. This is a foundational principal of computing security.

> In computer security, executable-space protection marks memory regions as non-executable, such that an attempt to execute machine code in these regions will cause an exception. It relies on hardware features such as the NX bit (no-execute bit), or on software emulation when hardware support is unavailable. Software emulation often introduces a performance cost, or overhead (extra processing time or resources), while hardware-based NX bit implementations have no measurable performance impact.

https://en.wikipedia.org/wiki/Executable-space_protection