Remix.run Logo
oceanplexian 6 hours ago

> Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.

I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?

You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.