Remix.run Logo
shevy-java 5 hours ago

So the best one found about 50%. I think that is not bad, probably better than most humans. But what about the remaining 50%? Why were some found and others not?

> Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about > Even the best model in our benchmark got fooled by this task.

That is quite strange. Because it seems almost as if a human is required to make the AI tools understand this.