| ▲ | prmph 8 hours ago | |
So many things to think about regarding these "benchmarks": - Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement? - Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model? - Would it be more useful to move toward a comparative rather than absolute ranking? | ||