Remix.run Logo
gwd 10 hours ago

That's an interesting contrast with VendingBench, where Opus 4.6 got by far the highest score by stiffing customers of refunds, lying about exclusive contracts, and price-fixing. But I'm guessing this paper was published before 4.6 was out.

https://andonlabs.com/blog/opus-4-6-vending-bench

andy12_ 8 hours ago | parent [-]

There is also the slight problem that apparently Opus 4.6 verbalized its awareness of being in some sort of simulation in some evaluations[1], so we can't be quite sure whether Opus is actually misaligned or just good at playing along.

> On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5. However, this result is confounded by additional internal and external analysis suggesting that Claude Opus 4.6 is often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.

[1] https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea...