Someone needs to train a model where untrusted input uses a completely different set of tokens so that it's entirely impossible for the model to confuse them with instructions. I've never even seen that approach mentioned let alone implemented.

▲

jorl17 an hour ago | parent [-]

Perhaps this is in line with what you had in mind? https://patents.google.com/patent/US12118471

	▲	LoganDark 5 minutes ago \| parent [-]
		> The input is represented as tokens, wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets. Yes, exactly!