Remix.run Logo
johndough 2 hours ago

I would have liked aggregated results instead. Expanding 300 tables is a bit tiresome. But I guess that is easy with AI now. Here is a scatter plot of quality vs duration

https://i.imgur.com/wFVSpS5.png

and quality vs cost

https://i.imgur.com/fqM4edw.png

But I just noticed that my plot is meaningless because it conflates model quality with provider uptime.

Claude Haiku has a higher average quality than Claude Opus, which does not make sense. The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.

skysniper an hour ago | parent [-]

> The explanation is that network errors were credited with a quality score of 0, and there were _a lot_ of network errors.

all network error, provider error, openclaw error are excluded from ranking calculation actually, so that is not the reason.

Real reason:

The absolute score is not consistent across tasks and cannot be directly added/averaged, for both human and LLM. But the relative rank is stable (model A is better than B). That is exactly why Chatbot Arena only uses the relative rank of models in each battle in the first place, and why we follow that approach.

a concrete example of why score across tasks cannot be added/averaged directly: people tend to try haiku with easier task and compare with T2 models, and try opus with harder task and compare with better models.

another example: judge (human or llm) tend to change score based on opponents, like Sonnet might get 10/10 if all other opponents are Haiku level, but might get 8/10 if opponent has Opus/gpt-5.4.

So if you want to make the plot, you should plot the elo score (in leaderboard) vs average cost per task. But note: the average cost has similar issue, people use smaller model to run simpler task naturally, so smaller model's lower cost comes from two factor: lower unit cost, and simpler task.

methodology page contains more details if you are interested.

johndough an hour ago | parent [-]

I agree. If humans are allowed to pick the models, there will be an inherent bias. This would be much easier if the models were randomized.