| ▲ | liuliu 4 hours ago | ||||||||||||||||
The solution is to make the model stronger so the malicious intents can be better distinguished (and no, it is not a guarantee, like many things in life). Sandbox is a basic, but as long as you give the model your credential, there isn't much guardrails can be done other than making the model stronger (separate guard model is the wrong path IMHO). | |||||||||||||||||
| ▲ | ramoz 3 hours ago | parent [-] | ||||||||||||||||
I think generally correct to say "hey we need stronger models" but rather ambitious to think we really solve alignment with current attention-based models and RL side-effects. Guard model gives an additional layer of protection and probably stronger posture when used as an early warning system. | |||||||||||||||||
| |||||||||||||||||