| ▲ | mrandish 4 hours ago | |
> the token counts to achieve these results I've also been increasingly curious about better metrics to objectively assess relative model progress. In addition to the decreasing ability of standardized benchmarks to identify meaningful differences in the real-world utility of output, it's getting harder to hold input variables constant for apples-to-apples comparison. Knowing which model scores higher on a composite of diverse benchmarks isn't useful without adjusting for GPU usage, energy, speed, cost, etc. | ||