| ▲ | scellus 3 hours ago | |
I like Opus 4.5 a lot, but a general comment on benchmarks: the number of subtasks or problems in each one is finite, and many of the benchmarks are saturating, so the effective number of problems at the frontier is even smaller. If you think of the generalizable capability of the model as a latent feature to be measured by benchmarks, we therefore have only rather noisy estimates. People read too much into small differences in numbers. It's best to aggregate across many, Epoch has their Capabilities Index, and Artificial Analysis is doing something similar, and probably others I don't know or remember. And then there's the part of models that is hard to measure. Opus has some sort of HAL-like smoothness I don't see in other models, but meanwhile, I haven't tried gpt-5.2 for coding yet. (Neither Gemini 3 Pro; I'm not claiming superiority of Opus, just that something in practical usability is hard to measure.) | ||