Remix.run Logo
andy12_ 10 hours ago

There is also the slight problem that apparently Opus 4.6 verbalized its awareness of being in some sort of simulation in some evaluations[1], so we can't be quite sure whether Opus is actually misaligned or just good at playing along.

> On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5. However, this result is confounded by additional internal and external analysis suggesting that Claude Opus 4.6 is often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.

[1] https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea...