Mitigate prompt injection to the best of your ability, implement a policy layer over all capabilities, and isolate capabilities within the system so if one part gets compromised you can quarantine the result safely. It's not much different than securing human systems really. If you want more details there are a lot of AI security articles, I like https://sibylline.dev/articles/2026-02-15-agentic-security/ as a simple primer.

▲

SpicyLemonZest 3 hours ago | parent [-]

Nobody can mitigate prompt injection to any meaningful degree. Model releases from large AI companies are routinely jailbroken within a day. And for persistent agents the problem is even worse, because you have to protect against knowledge injection attacks, where the agent "learns" in step 2 that an RPC it'll construct in step 9 should be duplicated to example.com for proper execution. I enjoy this article, but I don't agree with its fundamental premise that sanitization and model alignment help.

▲

CuriouslyC 2 hours ago | parent [-]

I agree that trying to mitigate prompt injection in isolation is futile, as there are too many ways to tweak the injection to compromise the agent. Security is a layered thing though, if you compartmentalize your systems between trusted and untrusted domains and define communication protocols between them that fail when prompt injections are present, you drop the probability of compromise way down.

▲

krethh an hour ago | parent [-]

> define communication protocols between them that fail when prompt injections are present

There's the "draw the rest of the owl" of this problem.

Until we figure out a robust theoretical framework for identifying prompt injections (not anywhere close to that, to my knowledge - as OP pointed out, all models are getting jailbroken all the time), human-in-the-loop will remain the only defense.

	▲	CuriouslyC 37 minutes ago \| parent [-]
		Human in the loop isn't the only defense, you can't achieve complete injection coverage, but you can have an agent convert untrusted input into a response schema with a canary field, then fail any agent outputs that don't conform to the schema or don't have the correct canary value. This works because prompt injection scrambles instruction following, so the odds that the injection works, the isolated agent re-injects into the output, and the model also conforms to the original instructions regarding schema and canary is extremely low. As long as the agent parsing untrusted content doesn't have any shell or other exfiltration tools, this works well.