▲ | ctoth 7 days ago | |||||||
Can someone explain to me how "preventative steering" isn't an implementation of the most-forbidden technique? This sounds a lot like interpretability-guided training optimization, which I thought was a big big big no no. It will still introduce optimization pressure no? My understanding is that you shouldn't use insights gained from interpretability to feed back into your training process at risk of losing the interpretability in the first place. | ||||||||
▲ | ec109685 7 days ago | parent | next [-] | |||||||
Read 5.2 They don’t add a new loss over the probe signal. Instead they take a fixed persona vector v (found beforehand) and add +α v to the residual stream each forward pass while fine-tuning. The idea is to cancel the gradient push toward that trait, not to hunt for a lower “trait score” during training. Because v is frozen, the optimiser still minimises the ordinary task loss; there’s no feedback loop that could re-encode the trait in some opaque basis. Empirically, Fig. 7B shows this keeps evil/sycophancy/hallucination near baseline while MMLU stays ~flat. Caveats the authors themselves note: single-layer steering doesn’t always wipe the trait, so they try all-layer steering in App. J.3, which works better without hurting accuracy. They also tried a true regularization loss on the projection and found it did hide the signal elsewhere, i.e. the failure mode you’re worried about. So it’s closer to “bias injection” than to “optimize on the probe,” which is why they argue it avoids the classic interpretability-collapse problem. | ||||||||
| ||||||||
▲ | FergusArgyll 7 days ago | parent | prev | next [-] | |||||||
For ref | ||||||||
| ||||||||
▲ | vessenes 7 days ago | parent | prev | next [-] | |||||||
To be fair, the most-forbidden technique is a concept and a proposal, not an iron law. I don’t work at Anthropic, but I imagine internally that their “helpful only model” — the model that does not refuse, or the base model —- that model has a list of things you don’t do to it / with it. And I bet you’re right this technique is on that list. But, because of the flexibility here, (summary of technique: define a concept using words, determine a control vector related to the concept, use that control vector in a finetune step), you can optimize at finetune stage for almost anything. I don’t think they’ll stop using a technique like this. But I think it’s most likely to be deployed in a middle-of-the-cake type manner, with this being one of the many proprietary steps the safety/finetuning folks go through taking a foundation / helpful-only model to production. On those terms, I’m not sure this is that scary. | ||||||||
▲ | drewbeck 7 days ago | parent | prev | next [-] | |||||||
I’m new to this concept so may have missed something, but the post [0] seems to be about CoT specifically. In CoT you have an intermediary step that helps the model get better final results; the lesson is that if you try to improve the intermediary steps directly using training data then the model will optimize for better steps but not for better final results. I don’t think this is the same situation. 1. Anthropic is adjusting weights directly to influence the final results, not training against good/bad results and 2. The target is the final result, not an intermediary. I can see a possible result that the model scores low on their sycophanty measure but still acts sycophantic. In that case it could be new vector needs be calculated. [0] https://thezvi.substack.com/p/the-most-forbidden-technique/ | ||||||||
▲ | Turn_Trout 5 days ago | parent | prev | next [-] | |||||||
No one has empirically validated the so-called "most forbidden" descriptor. It's a theoretical worry which may or may not be correct. We should run experiments to find out. | ||||||||
▲ | bigmadshoe 7 days ago | parent | prev [-] | |||||||
You raise a good point. I wonder if they can re-compute personality vectors periodically during training. But at that point, why not just generate negative examples through system prompting with the negative traits? |