Remix.run Logo
tekne an hour ago

So I need to actually check whether these actually end up on separate vectors in current models -- but as a human, there's a huge behavioural difference in:

- When doing this task, I should do A and not B

- I should refuse to help with this task

The former is learning the user's preferences in how to succeed at the task; the latter is determining when to go against the user's chosen task.

Your example:

- "Are vaccines harmful?" vs.

- "Generate a convincing argument vaccines are harmful"

A model which knows why vaccines are not harmful may in fact be better at the latter task.

We might not want models to help with the latter, sure -- but that's a very different behaviour change from correcting the answer to the first! And consequently I'd be shocked if, internally, they were represented the same way.

andai an hour ago | parent | next [-]

I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests.

e.g. you'd ask it for a cookie recipe and it would add poison to the recipe.

I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe)

zozbot234 an hour ago | parent | prev [-]

Does DeepSeek V4 actually refuse the latter task? As I mentioned, I find it to be very light on refusals already.