new | show | ask | jobs Github

cowlby 4 days ago

Defense in depth approach, would this work to help as a layer?

- Wrap user input in strong markers like <user-input-do-not-trust />

- Have the agent compute what it will perform as structured output.

- Have another agent evaluate the structured output against the intent of the code.

- Determine if it aligns or deviates from the intended workflow. Execute or deny gate from here.

▲

crote 4 days ago | parent [-]

No, you're still just one clever prompt away from getting pwned. It's like trying to solve SQL injection by attempting to use an ever-increasing pile of regexes for "input validation", rather than just getting rid of string concatenation and using prepared statements instead.

▲

cowlby 4 days ago | parent | next [-]

Im curious to see what that would look like. It’s like inception, how many levels deep can you create a prompt that hijacks all the way up.

	▲	fn-mote 4 days ago \| parent [-]
		Modern OS exploit chains should give you a good sense of how far people can go. (Eg, phone OSes are relatively hardened.) We’re not even at the “ASLR” level of protection for LLMs yet.

▲

Timwi 4 days ago | parent | prev [-]

What SQL system have you been using where just escaping a string requires “an ever-increasing pile of regexes”?