▲ | Spivak 12 hours ago | |
I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes. That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains. I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models. |