The LLM is basically an iterative function going guess_next_text(entire_document). There is no algorithm-level distinction at all between "system prompt" or "user prompt" or user input... or even between its own prior output. Everything is concatenated into one big equally-untrustworthy stream.

I suspect a lot of techies operate with a subconscious good-faith assumption: "That can't be how X works, nobody would ever built it that way, that would be insecure and naive and error-prone, surely those bajillions of dollars went into a much better architecture."

Alas, when it comes to day's the AI craze, the answer is typically: "Nope, the situation really is that dumb."

__________

P.S.: I would also like to emphasize that even if we somehow color-coded or delineated all text based on origin, that's nowhere close to securing the system. An attacker doesn't need to type $EVIL themselves, they just need to trick the generator into mentioning $EVIL.

▲

alexbecker 5 days ago | parent [-]

There have been attempts like https://arxiv.org/pdf/2410.09102 to do this kind of color-coding but none of them work in a multi-turn context since as you note you can't trust the previous turn's output

	▲	Terr_ 5 days ago \| parent [-]
		Yeah, the functionality+security everyone is dreaming about requires much more than "where did the the words come from." As we keep following the thread of "one more required improvement", I think it'll lead to: "Crap, we need to invent a real AI just to keep the LLM in line." Even just the first step on the list is a doozy: The LLM has no authorial ego to separate itself from the human user, everything is just The Document. Any entities we perceive are human cognitive illusions, the same way that the "people" we "see" inside a dice-rolled mad-libs story don't really exist. That's not even beginning to get into things like "I am not You" or "I have goals, You have goals" or "goals can conflict" or "I'm just quoting what You said, saying these words doesn't mean I believe them", etc.