Remix clone Hacker News

new | show | ask | jobs Github

	▲	andai an hour ago
		I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests. e.g. you'd ask it for a cookie recipe and it would add poison to the recipe. I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe)