Remix.run Logo
the_data_nerd 10 hours ago

Right. Removing the refusal head does not put the missing distribution back. Every pass before it, pretraining mix, SFT, RLHF, synthetic data, already pulled the charged tokens down. You can jailbreak the gate and still get mild output because the probability mass was gone ten steps ago.