| ▲ | andai an hour ago | |
I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests. e.g. you'd ask it for a cookie recipe and it would add poison to the recipe. I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe) | ||