They are selling an impossible product.

If you make an LLM more safe, you are going to shift the weight for defensive actions as well.

There’s no physical way to assign weights to have one and not the other.

> If you make an LLM more safe, you are going to shift the weight for defensive actions as well. > > There’s no physical way to assign weights to have one and not the other.

Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human?

If no, how does a cybersec firm train its employees?

If yes, how can you make the bold claim that it's possible for a human to differentiate between the two cases using incoming text as their basis for judgement, but IMpossible for an LLM to be configured to do the same? Note that if some hypothetical completely-determinstic LLM that always rejects "attack" requests and accepts "defense" ones can exist, the claim it's impossible is false. Providing nondeterministic output for a given input is not a hard requirement for language models.

	▲	beering an hour ago \| parent [-]
		> Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human? > If no, how does a cybersec firm train its employees? In general, no, humans can’t be sure they are only helping with defensive and not offensive work unless they have more context. IRL, a security engineer would know who they’re working for. If they’re advising Apple, then they’d feel pretty confident that Apple is not turning around and hacking people.