Remix.run Logo
dijksterhuis 2 hours ago

normal

    y = f(x)
prompt injection / adversarial example (same thing really)

    bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.

the only ways to fully “fix” it ie to make prompt injection never possible

1. don’t use ai

2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.

otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.

anuramat an hour ago | parent [-]

> tweak badness enough

assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

> the only way to fix ...

the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

solid_fuel 22 minutes ago | parent [-]

> assuming you get to do gradient descent AND the context is fixed+known AND you have unlimited compute? sure; is it a realistic setup?

Clearly nothing so complicated is required, given the prompt in the very article you are commenting on.

> the exact same argument applies to any (sufficiently complex) piece of software, with exactly the same conclusion

Yeah and the halting problem is hard too, but there's levels to this shit.

> also technically I'd argue that we do know the input/output space (set of all token strings of length <= N/token), and know the mapping (the model is a ~pure function in terms of the api, which is about as good of a representation as it gets for a non-invertible mapping); at least it's much closer than with something like linux

I would argue we don't even know the desired output for most inputs for an LLM and they certainly aren't trained on every possible input state. But I think Linux and LLMs are sufficient different that they aren't really directly comparable like this. After all, Linux is not a pure function and has lots of side effects.

But just to establish an order of magnitude: the input space for ChatGPT 3.0 was 2,048 tokens long. There were 50,257 tokens in the vocabulary. The input space thus has 50,257^(2048) unique states, which is approximately equal to 1.12 × 10^9628. That's an awful big input space for a single function.