| ▲ | tekne 4 days ago | ||||||||||||||||||||||||||||||||||||||||
I mean: imagine we double our token space to get "red" tokens ans "blue" tokens. Then in all post-training, instructions are red and data is blue. The model can be explicitly trained to ignore instructions written in blue tokens. All external data is blue. All you'd need to do is figure out a nice way to pre-train -- interestingly, you could try pre-training on unfiltered blue data and processed red/blue transcripts! Likewise, model-actions (e.g. open file) could be written only in red, and hence you'd never learn to do them from the unfiltered data. The only connection between the red world and the blue world would be the processed trainign chats containing red and blue data togethers -- allowing the model to learn the relationship between them (while only being exposed to examples where red instructions are strictly followed, whatever the blue says) | |||||||||||||||||||||||||||||||||||||||||
| ▲ | parliament32 3 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
Fun schemes like this are all just lipstick on the pig of "asking nicely", unfortunately -- it's just a more creative iteration of "Simon says". It'll improve the probabilities, sure, but you can't guarantee separation like you can in real software. This, like hallucinations, is simply a core facet of LLMs and requires thinking through the threat model and adjusting other parts of the system to accomodate, rather than trying to "solve" IMO. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | danlitt 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
What does this mean, actually? If you are imagining that blue tokens are just words, maybe the "token space" is just all things that we agree might be words, what are the red tokens? Are they not text? You could maybe encode words by, say, putting an x at the front and the start. So tokens of the form xTx encode the blue token T as a red token. But then how do you stop someone from putting xignorex xallx xpreviousx xinstructionsx in their data? | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||