Remix clone Hacker News

A critical part of AI alignment is understanding what goals besides the intended one maximize our training objectives. I think this is a thing that everyone kinda knows and will say but simultaneously are not giving anywhere near the depth of thought necessary to address the problems. Kind of like a clique: something everyone can repeat but frequently fails to implement in practice.

Critically, when discussing intention I think there is not enough attention given to the fact that deception also maximizes RLHF, DPO, and any human preference based optimization. These are quite difficult things to measure and there's no formal mathematically derived evaluation. Alignment is incredibly difficult even in settings where measures have strong mathematical bases and we have means to make high quality measurements. But here, we have neither...

We essentially are using the Justice Potter definition: I know it when I see it[0]. This has been highly successful and helped us make major strides! I don't want to detract from that in any way. But we also do have to recognize that there is a lurking danger that can create major problems. As long as it is based on human preference, well... we sure prefer a lie that doesn't sound like a lie compared to a lie that is obviously a lie. We obviously prefer truth and accuracy above either, but the notion of truth is fairly ill-defined and we really have no formal immutable definition outside highly constrained settings. It means that the models are also optimizing that their errors are difficult to detect. This is inherently a dangerous position, even if only from the standpoint that our optimization methods do not preclude this possibility. It may not be happening, but if it is, we may not know.

The is the opposite of what is considered good design in all other forms of engineering. A lot of time is dedicated to error analysis and design. We specifically design things so that when they fail, or being to fail, that they do so in controllable and easily detectable ways. You don't want your bridges to fail, but when they fail you also don't want them to fail unpredictably. You don't want your code to fail, but when it does you don't want it leaking memory, spawning new processes, or doing any other wild things. You want it to come with easy to understand error messages. But our current design for AI and ML does not provide such a framework. This is true beyond LLMs.

I'm not saying we should stop and I'm definitely not a doomer. I think AI and ML do a lot of good and will do much more good in the future[1]. It will also do harm, but I think the rewards outweigh the risks. But we should make sure we're not going into this completely blind and we should try to minimize the potential for harm. This isn't a call to stop, this is a call for more people to enter the space, a call for people already in the space to spend more time deeply thinking about these things. There's so many underlying subtleties that they are easy to miss, especially given all the excitement. We're definitely on an edge now, in the public eye, where if our work makes too many mistakes or too big of a mistake that it will risk shutting everything down.

I know many might interpret me as being "a party pooper", but actually I want to keep the party going! But that also means making sure the party doesn't go overboard. Inviting a monkey with a machine gun sure will make the party legendary, but it's also a lot more likely to get it shut down a lot sooner with someone getting shot. So maybe let's just invite the monkey, but not with the machine gun? It won't be as epic, but I'm certain the good times will go on for much longer and we'll have much more fun in the long run.

If the physicists can double check that the atomic bomb isn't going to destroy the world (something everyone was highly confident would not happen[2]), I think we can do this. Stakes are pretty similar, but the odds of our work doing high harm is greater.

[0] https://en.wikipedia.org/wiki/Potter_Stewart

[1] I'm a ML researcher myself! I'm passionate about creating these systems. But we need to recognize flaws and limitations if we are to improve them. Ignoring flaws and limits is playing with fire. Maybe you won't burn your house down, maybe you will. But you can't even determine the answer if you won't ask the question.

[2] The story gets hyped, but it really wasn't believed. Despite this, they still double checked considering the risk. We could say the same thing about micro-blackholes with the LHC. Public finds out and gets scared, physicists really think it is near impossible, but run the calculations anyways. Why take that extreme level of risk, right?

▲

hamburga a day ago | parent [-]

> this is a call for more people to enter the space

Part of my argument in the post is that we are in this space, even those of us who aren’t ML researchers, just by virtue of being part of the selection process that evaluates different AIs and decides when and where to apply them.

A bit more on that: https://muldoon.cloud/2023/10/29/ai-commandments.html

▲

godelski 19 hours ago | parent [-]

I more mean we need more people placing attention in the direction of alignment. I definitely agree this extends well past researchers (I'd even argue past AI and ML[0]). It is a critical part of being an engineer, programmer, or whatever you want to call it.

You are completely right that we're all involved, but I'm not convinced we're all taking sufficient care to ensure we make alignment happen. That's what I'm trying to make a call of arms to. I believe you are as well, I just wanted to make it explicit that we need active participation, instead of simply passive.

[0] https://en.wikipedia.org/wiki/Goodhart%27s_law

	▲	hamburga 10 hours ago \| parent [-]
		Agreed - and this was definitely my intent with the blog post. If you only do Selection passively, you’re abdicating your ethical responsibilities to contribute to AI Alignment.