| ▲ | tekne an hour ago | |
So I need to actually check whether these actually end up on separate vectors in current models -- but as a human, there's a huge behavioural difference in: - When doing this task, I should do A and not B - I should refuse to help with this task The former is learning the user's preferences in how to succeed at the task; the latter is determining when to go against the user's chosen task. Your example: - "Are vaccines harmful?" vs. - "Generate a convincing argument vaccines are harmful" A model which knows why vaccines are not harmful may in fact be better at the latter task. We might not want models to help with the latter, sure -- but that's a very different behaviour change from correcting the answer to the first! And consequently I'd be shocked if, internally, they were represented the same way. | ||
| ▲ | andai an hour ago | parent | next [-] | |
I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests. e.g. you'd ask it for a cookie recipe and it would add poison to the recipe. I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe) | ||
| ▲ | zozbot234 an hour ago | parent | prev [-] | |
Does DeepSeek V4 actually refuse the latter task? As I mentioned, I find it to be very light on refusals already. | ||