| ▲ | creatonez 9 hours ago | |||||||||||||||||||||||||
> For example, it focuses a lot on doing "ablation studies", by which it means removing random layers of an already-trained model, to find the source of the refusals(?), which is an absolute fool's errand because such behavior is trained into the model as a whole and would not be found in any particular layer. That doesn't mean there couldn't be a "concept neuron" that is doing the vast majority of heavy lifting for content refusal, though. | ||||||||||||||||||||||||||
| ▲ | mapontosevenths 4 hours ago | parent [-] | |||||||||||||||||||||||||
Thats not what it means at all. It uses SVD[0] to map the subspace in which the refusal happens. Its all pretty standard stuff with some hype on top to make it an interesting read. Its basically using a compression technique to figure out which logits are the relevant ones and then zeroing them. [0] https://en.wikipedia.org/wiki/Singular_value_decomposition | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||