Remix.run Logo
vessenes 7 days ago

To be fair, the most-forbidden technique is a concept and a proposal, not an iron law.

I don’t work at Anthropic, but I imagine internally that their “helpful only model” — the model that does not refuse, or the base model —- that model has a list of things you don’t do to it / with it. And I bet you’re right this technique is on that list.

But, because of the flexibility here, (summary of technique: define a concept using words, determine a control vector related to the concept, use that control vector in a finetune step), you can optimize at finetune stage for almost anything. I don’t think they’ll stop using a technique like this. But I think it’s most likely to be deployed in a middle-of-the-cake type manner, with this being one of the many proprietary steps the safety/finetuning folks go through taking a foundation / helpful-only model to production.

On those terms, I’m not sure this is that scary.