| ▲ | GavCo 18 hours ago | ||||||||||||||||
Author here. If by conflate you mean confuse, that’s not the case. I’m positing that the Anthropic approach is to view (1) and (2) as interconnected and both deeply intertwined with model capabilities. In this approach, the model is trained to have a coherent and unified sense of self and the world which is in line with human context, culture and values. This (obviously) enhances the model’s ability to understand user intent and provide helpful outputs. But it also provides a robust and generalizable framework for refusing to assist a user due to their request being incompatible with human welfare. The model does not refuse to assist with making bio weapons because its alignment training prevents it from doing so, it refuses for the same reason a pro-social, highly intelligent human does: based on human context and culture, it finds it to be inconsistent with its values and world view. > the piece dismisses it with "where would misalignment come from? It wasn't trained for." this is a straw-man. you've misquoted a paragraph that was specifically about deceptive alignment, not misalignment as a whole | |||||||||||||||||
| ▲ | ctoth 16 hours ago | parent | next [-] | ||||||||||||||||
Deceptive alignment is misalignment. The deception is just what it looks like from outside when capability is high enough to model expectations. Your distinction doesn't save the argument - the same "where would it come from?" problem applies to the underlying misalignment you need for deception to emerge from. | |||||||||||||||||
| |||||||||||||||||
| ▲ | godelski 9 hours ago | parent | prev | next [-] | ||||||||||||||||
I just want to point out that we train these models for deceptive alignment[0-3]In the training, especially during RLHF, we don't have objective measures[4]. There's no mathematical description, and thus no measure, for things like "sounds fluent" or "beautiful piece of art." There's also no measure for truth, and importantly, truth is infinitely complex. You must always give up some accuracy for brevity. The main problem is that if we don't know an output is incorrect we can't penalize it. So guess what happens? While optimizing for these things we don't have good descriptions for but "know it when you see it", we ALSO optimize for deception. There's multiple things that can maximize our objective here. Our intended goals being one but deception is another. It is an adversarial process. If you know AI, then think of a GAN, because that's a lot like how the process works. We optimize until the discriminator is unable to distinguish the LLMs outputs form human outputs. But at least in the GAN literature people were explicit about "real" vs "fake" and no one was confused that a high quality generated image is one that deceives you into thinking it is a real image. The entire point is deception. The difference here is we want one kind of deception and not a ton of other ones. So you say that these models aren't being trained for deception, but they explicitly are. Currently we don't even know how to train them to not also optimize for deception. [0] https://news.ycombinator.com/item?id=44017334 [1] https://news.ycombinator.com/item?id=44068943 [2] https://news.ycombinator.com/item?id=44163194 [3] https://news.ycombinator.com/item?id=45409686 [4] Objective measures realistically don't exist, but to clarify it's not checking like "2+2=4" (assuming we're working with the standard number system). | |||||||||||||||||
| |||||||||||||||||
| ▲ | xpe 17 hours ago | parent | prev [-] | ||||||||||||||||
We can only make various inferences about what is in an author's head (e.g. clarity or confusion), but we can directly comment on what a blog post says. This post does not clarify what kind of alignment is meant, which is a weakness in the writing. There is a high bar for AI alignment research and commentary. | |||||||||||||||||