Remix.run Logo
vessenes 3 hours ago

Cheapest just isn't a very useful metric. Can I suggest a Pareto-curve type representation? Cost / request vs ELO would be useful and you have all the data.

skysniper 2 hours ago | parent [-]

TBH that was my initial thought too, but I found some problem using this approach:

Essentially I'm using the relative rank in each battle to fit a latent strength for each model, and then use a nonlinear function to map the latent strength to Elo just for human readability. The map function is actually arbitrary as long as it's a monotonically increasing function so it preserves the rank. The only reliable result (that is invariant to the choice of the function) is the relative rank of models.

That being said, if I use score/cost as metrics, the rank completely depends on the function I choose, like I can choose a more super-linear function to make high performance model rank higher in score/cost board, or use a more sub-linear function to make low performance model rank higher.

That's why I eventually tried another (the current) approach: let judge give relative rank of models just by looking at cost-effectiveness (consider both performance and cost), and compute the cost-effectiveness leaderboard directly, so the score mapping function does not affect the leaderboard at all.