Remix.run Logo
craigus 6 days ago

"New science" phooey.

Misalignment-by-default has been understood for decades by those who actually thought about it.

S. Omohundro, 2008: "Abstract. One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted."

https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...

E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."

https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...

qnleigh 5 days ago | parent | next [-]

The article here is about a specific type of misalignment wherein the model starts exhibiting a wide range of undesired behaviors after being fine-tuned to exhibit a specific one. They are calling this 'emergent misalignment.' It's an empirical science about a specific AI paradigm (LLMs), which didn't exist in 2008. I guess this is just semantics, but to me it seems fair to call this a new science, even if it is a subfield of the broader topic of alignment that these papers pioneered theoretically.

But semantics phooey. It's interesting to read these abstracts and compare the alignment concerns they had in 2008 to where we are now. The sentence following your quote of the first paper reads "We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves." This was a credible concern 17 years ago, and maybe it will be a primary concern in the future. But it doesn't really apply to LLMs in a very interesting way, which is that we somehow managed to get machines that exhibit intelligence without being particularly goal-oriented. I'm not sure many people anticipated this.

MostlyStable 5 days ago | parent [-]

Also, EY specifically replied to these results when they originally came out and said that he wouldn't have predicted them [0] (and that he considered this good news actually)

[0] https://x.com/ESYudkowsky/status/1894453376215388644

justlikereddit 5 days ago | parent | prev [-]

[flagged]

mofeien 5 days ago | parent | next [-]

People like yudkowsky might have polarizing opinions and may not be the easiest to listen to, especially if you disagree with them. Is this your best rebuttal, though?

bigyabai 5 days ago | parent [-]

FWIW, I agree with the parent comment's rebuttal. Simply saying "AI could be bad" is nothing Asimov or Roddenbury didn't figure out themselves.

For Elizer to really deign novelty here, he'd have predicted the reason why this happens at all: training data. Instead he played the Chomsky card and insisted on deeper patterns that don't exist (as well as solutions that don't work). Namedropping Elizer's research as a refutation is weak bordering on disingenuous.

MostlyStable 5 days ago | parent | next [-]

I think there is an important difference between "AI can be bad" and "AI will be bad by default", and I didn't think anyone was making it before. One might disagree but I didn't think one can argue it wasn't a novel contribution.

Also, if your think they had solutions, ones that work or otherwise, then you haven't been paying attention. Half of their point is that we don't have solutions. And we shouldn't be building AI until we do.

Again, I think that reasonable people can disagree with that crowd. But I can't help noticing a pattern where almost everyone who disagrees is almost always misrepresenting their work and what they say.

DennisP 5 days ago | parent | prev | next [-]

Except training data is not the reason. Or at least, not the only reason.

digbybk 5 days ago | parent | prev [-]

What were the deeper patterns that don't exist?

wizzwizz4 5 days ago | parent | prev | next [-]

Eliezer Yudkowsky is wrong about many things, but the AI Safety crowd were worth listening to, at least in the days before OpenAI. Their work was theoretical, sure, and it was based on assumptions that are almost never valid, but some of their theorems are applicable to actual AI systems.

justlikereddit 5 days ago | parent [-]

They were never worth listening to.

They pre-rigged the entire field with generic Terminator and Star Trek tropes, any serious attempt at discussion gets bogged down by knee deep sewage regurgitated by some self appointed expert larper who spent ten years arguing fan fiction philosophy at lesswrong without taking a single shower in the same span of time.

solveit 5 days ago | parent | next [-]

It's frustrating how far you can go out of your way to avoid being associated with such superficially similar tropes and still fail miserably. Yudkowsky in particular hated that he couldn't get a discussion without being typecast as the guy worried about Terminator. He hated it to the point he wrote a whole article on why he thought Terminator tropes were bad (https://www.lesswrong.com/posts/rHBdcHGLJ7KvLJQPk/the-logica...).

As a side note:

> any serious attempt at discussion gets bogged down by [...] without taking a single shower in the same span of time.

This is unnecessary and (somewhat ironically) undermines your own point. I would like to see less of this on HN.

jsnider3 5 days ago | parent | prev [-]

Then it should be easy for you to make an aligned AI, right? Can I see it?

wizzwizz4 4 days ago | parent [-]

Aligned AI is easy. https://en.wikipedia.org/wiki/Expert_system

The hard part is extrapolated alignment, and I don't think there's a good solution to this. Large groups of humans are good at this, eventually (even if they tend to ignore their findings about morality for hundreds, or thousands, of years, even past the point where over half the local population knows, understands, and believes those findings), but individual humans are pretty bad at moral philosophy. (Simone Weil was one of the better ones, but even she thought it was more important to Do Important Stuff (i.e., get in the way of more competent resistance fighters) than to act in a supporting role.)

Of course, the Less Wrongians have extremely flawed ideas about extrapolated alignment (e.g. Eliezer Yudkowsky thinks that "coherent extrapolated volition" is a coherent concept that one might be able to implement, given incredible magical powers), and OpenAI's twisted parody of their ideas is even worse. But it's thanks to the Less Wrongians' writings that I know their ideas are flawed (and that OpenAI's marketing copy is cynical lies / cult propaganda). "Coherent extrapolated volition" is the kind of idea I would've come up with myself, eventually, and (unlike Eliezer Yudkowsky, who identified some flaws almost immediately) I would probably have become too enamoured with it to have any sensible thoughts afterwards. Perhaps the difficulty (impossibility) of actually trying to build the thing would've snapped me out of it, but I really don't know.

Anyway: extrapolated alignment is out (for now, and perhaps forever). But it's easy enough to make a "do what I mean" machine that augments human intelligence, if you can say all the things it's supposed to do. And that accounts for the majority of what we need AI systems to do: for most of what people use ChatGPT for nowadays, we already had expert systems that do a vastly better job (they just weren't collected together into one toolsuite).

achierius 4 days ago | parent [-]

Ok, sorry, rephrase: a useful aligned AI.

wizzwizz4 4 days ago | parent [-]

Expert systems are plenty useful. For example, content moderation: an expert system can interpret and handle the common cases, leaving only the tricky cases for humans to deal with. (It takes a bit of thought to come up with the rules, but after the dozenth handling of the same issue, you've probably got a decent understanding of what it is that is the same – perhaps good enough to teach to the computer.)

Expert systems let you "do things that don't scale", at scale, without any loss of accuracy, and that is simply magical. They don't have initiative, and can't make their own decisions, but is it ever useful for a computer to make decisions? They cannot be held accountable, so I think we shouldn't be letting them, even before considering questions of competence.

bondarchuk 5 days ago | parent | prev [-]

Yudkowsky Derangement Syndrome...