Well, the challenge is to know if the action supposed to be executed BEFORE it is requested to be executed. If the email with my secrets is sent, it is too late to deal with the consequences.

Sandboxes could provide that level of observability, HOWEVER, it is a hard lift. Yet, I don't have better ideas either. Do you?

▲

liuliu 4 hours ago | parent | next [-]

The solution is to make the model stronger so the malicious intents can be better distinguished (and no, it is not a guarantee, like many things in life). Sandbox is a basic, but as long as you give the model your credential, there isn't much guardrails can be done other than making the model stronger (separate guard model is the wrong path IMHO).

▲

ramoz 3 hours ago | parent [-]

I think generally correct to say "hey we need stronger models" but rather ambitious to think we really solve alignment with current attention-based models and RL side-effects. Guard model gives an additional layer of protection and probably stronger posture when used as an early warning system.

▲

liuliu 2 hours ago | parent [-]

Sure. If you treat "guard model" as diversification strategy, it is another layer of protection, just like diversification in compilation helps solving the root of trust issue (Reflections on Trusting Trust). I am just generally suspicious about the weak-to-strong supervision.

I think it is in general pretty futile to implement permission systems / guardrails which basically insert a human in the loop (humans need to review the work to fully understand why it needs to send that email, and at that point, why do you need a LLM to send the email again?).

	▲	ramoz 2 hours ago \| parent [-]
		fair enough

▲

ramoz 4 hours ago | parent | prev [-]

if you extend the definition of sandbox, then yea.

Solutions no, for now continued cat/mouse with things like "good agents" in the mix (i.e. ai as a judge - of course just as exploitable through prompt injection), and deterministic policy where you can (e.g. OPA/rego).

We should continue to enable better integrations with runtime - why i created the original feature request for hooks in claude code. Things like IFC or agent-as-a-judge can form some early useful solutions.