Remix.run Logo
yanis_t 8 hours ago

The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.

solenoid0937 7 hours ago | parent [-]

https://marginlab.ai/trackers/claude-code-historical-perform...

taylorfinley 4 hours ago | parent [-]

Surely they are testing their optimizations against common benchmarks internally? I bet the "real world task" degradation is larger by some multiple than it appears when measured through a benchmark that is part of the target.