| ▲ | bcrosby95 2 hours ago | |||||||
This is where the old line of "LLMs are just next token predictors" actually factors in. I don't know how you get a next token predictor that user input can't break out of. The answer is for the implementer to try to split what they can, and run pre/post validation. But I highly doubt it will ever be 100%, its fundamental to the technology. | ||||||||
| ▲ | miki123211 41 minutes ago | parent | next [-] | |||||||
I think this is fundamental to any technology, including human brains. Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans. LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover. LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience. | ||||||||
| ||||||||
| ▲ | salt4034 40 minutes ago | parent | prev [-] | |||||||
It's hard in general, but for instruct/chat models in particular, which already assume a turn-based approach, could they not use a special token that switches control from LLM output to user input? The LLM architecture could be made so it's literally impossible for the model to even produce this token. In the example above, the LLM could then recognize this is not a legitimate user input, as it lacks the token. I'm probably overlooking something obvious. | ||||||||
| ||||||||