> Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...

▲

Zee2 an hour ago | parent | next [-]

Alignment “appearing” better as model capabilities increase scares the shit out of me, tbh.

▲

goekjclo an hour ago | parent | prev | next [-]

I don't know if they can be any more 'cautious' for Mythos 2...

▲

tekacs 2 hours ago | parent | prev | next [-]

"We want to see risks in the models, so no matter how good the performance and alignment, we’ll see risks, results and reality be damned."

	▲	randomcatuser an hour ago \| parent [-]
		i mean, to be fair, these are professional researchers. i'm very inclined to trust them on the various ways that models can subtly go wrong, in long-term scenarios for example, consider using models to write email -- is it a misalignment problem if the model is just too good at writing marketing emails?? or too good at getting people to pay a spammy company? another hot use case: biohacking. if a model is used to do really hardcore synthetic chemistry, one might not realize that it's potentially harmful until too late (ie, the human is splitting up a problem so that no guardrails are triggered)

▲

CamperBob2 an hour ago | parent | prev [-]

Translation: yay, more paternalism.

▲

kay_o an hour ago | parent [-]

Anthropic always goes on and on about how their models are world changing and super dangerous like every single time they make something new they say its going to rewrite everything and scary lmao

funny because they do it every time like clockwork acting like their ai is a thunderstorm coming to wipe out the world

▲

wolttam 35 minutes ago | parent [-]

If there are advancements, they have to be described somehow.

What if the capability advancements are real and they warrant a higher level of concern or attention?

Are we just going to automatically dismiss them because "bro, you're blowing it up too much"

Either way these improvements to capabilities are ratcheting along at about the pace that many people were expecting (and were right to expect). There is no apparent reason they will stop ratcheting along any time soon.

The rational approach is probably to start behaving as if models that are as capable as Anthropic says this one is do actually exist (even if you don't believe them on this one). The capabilities will eventually arrive, most likely sooner than we all think, and you don't want to be caught with your pants down.

	▲	kay_o 30 minutes ago \| parent [-]
		I believe advancements sure. But it is a very boy who cried wolf situation for some of these. There are other companies that behave less in this way, Antrhopic seem very unique in that they love making every single release a world ender