| ▲ | cromka 2 hours ago | |
That's exactly why there's a ton of different benchmarking suites used for evaluating hardware performance. I reckon we'll have similar suites comparing different aspects of models. And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test. | ||