I think generally correct to say "hey we need stronger models" but rather ambitious to think we really solve alignment with current attention-based models and RL side-effects. Guard model gives an additional layer of protection and probably stronger posture when used as an early warning system.

▲

liuliu 2 hours ago | parent [-]

Sure. If you treat "guard model" as diversification strategy, it is another layer of protection, just like diversification in compilation helps solving the root of trust issue (Reflections on Trusting Trust). I am just generally suspicious about the weak-to-strong supervision.

I think it is in general pretty futile to implement permission systems / guardrails which basically insert a human in the loop (humans need to review the work to fully understand why it needs to send that email, and at that point, why do you need a LLM to send the email again?).

	▲	ramoz 2 hours ago \| parent [-]
		fair enough